Performance Tuning

Thread pools, cache sizes, batch sizes, and hardware recommendations for a busy Mothership. How to profile, where to look, and what each knob actually changes.

Overview

Aura's Mothership is, at its core, a streaming system. Peers push function-level updates, the server appends intents, resolves function identities, updates zone state, and fans changes out to other interested peers. At small scale, everything fits comfortably on a single modest instance; at several hundred concurrent peers, the shape of the work — many small writes, fan-out reads, a hot identity cache — starts to reward careful tuning.

This page describes how to read the telemetry, which knob affects which metric, and the hardware shapes we have seen work. The goal is to give a platform team enough of a mental model to tune without guessing.

Default sizing

Out of the box, the Mothership targets a deployment comfortably handling 50 concurrent peers with headroom. Defaults:

[server]
workers = 8            # request handler threads
accept_backlog = 1024

[database]
pool_size = 16
statement_timeout_ms = 10000

[sync]
batch_size = 128
flush_interval_ms = 500
max_in_flight = 1024

[cache]
identity_cache_mb = 256
fragment_cache_mb = 1024

[runtime]
tokio_worker_threads = 0   # 0 = auto, one per physical core

These are conservative. They work well for teams up to about 75 engineers. Beyond that, expect to tune at least pool_size, identity_cache_mb, and max_in_flight.

Hardware recommendations

The recommendations below come from production deployments we help operate. They are starting points, not hard requirements; your workload shape will pull you in one direction or another.

Peers (concurrent)	CPU (physical cores)	RAM	Postgres shape	Network
Up to 50	4	8 GB	Single instance, 2 vCPU, 4 GB RAM	1 Gbit/s
50–150	8	16 GB	Single instance, 4 vCPU, 16 GB RAM	1 Gbit/s
150–300	16	32 GB	Primary + read replica, 8 vCPU each	10 Gbit/s
300–500	32	64 GB	Primary + 2 read replicas, 16 vCPU each	10 Gbit/s
500+	Horizontal scale recommended		Managed HA Postgres, 32 vCPU+	10 Gbit/s

Concurrent peers is the right axis, not engineer count. An engineer who is pushing intensively counts as one peer; an engineer at a standup counts as less than one. Multiply engineer count by 0.4 as a rule of thumb for "typical concurrent."

Agent fleets shift the math. A fleet of 200 autonomous agents that push every few seconds during active hours behaves like 200 engineers actively pushing — allocate hardware accordingly.

The four knobs that matter most

If you tune nothing else, tune these four. They account for the large majority of performance issues we see.

1. Postgres pool size

Every Mothership worker that serves a request checks out a database connection. If the pool is undersized, workers queue waiting for a connection, latency rises, and the aura_postgres_pool_wait_seconds metric becomes non-zero.

Rule of thumb: pool_size = max(16, workers * 2). Above 100, you are likely bottlenecked on Postgres, not on the pool; scale Postgres first.

[database]
pool_size = 64
statement_timeout_ms = 15000

2. Identity cache size

Function identity resolution is the hottest path in the Mothership. The cache holds recently-resolved identities so that repeated pushes touching the same functions do not re-scan history.

The correct size is "large enough that the working set fits." You can tell by watching aura_identity_cache_hit_ratio: below 0.95 sustained means the cache is small. A well-provisioned deployment sees ratios above 0.99.

[cache]
identity_cache_mb = 2048

Above a few gigabytes, returns diminish. At that point, a bigger cache will not help a workload; a Postgres read replica will.

3. Sync batch size

Peers send function-level updates one at a time, but the Mothership batches them before writing to Postgres and pushing to downstream peers. Small batches mean more round trips and higher per-commit overhead; large batches mean higher tail latency for the last update in a batch.

Defaults are tuned for interactive feel. For agent-heavy workloads where latency matters less than throughput, increase the batch:

[sync]
batch_size = 512
flush_interval_ms = 250
max_in_flight = 4096

4. Tokio worker threads

The async runtime defaults to one worker per physical core. On hyperthreaded systems running exclusively the Mothership, leaving the default is correct. On mixed workloads (Mothership co-resident with other services), cap it:

[runtime]
tokio_worker_threads = 8

Profiling a busy Mothership

The Mothership ships a set of metrics chosen for the questions platform teams actually ask. The ones to alert on:

aura_request_duration_seconds — P50, P95, P99 of overall request latency. The top-level SLO signal.
aura_postgres_pool_wait_seconds — any sustained non-zero value points at pool exhaustion.
aura_identity_cache_hit_ratio — below 0.95 means the cache is undersized or the working set is unusual.
aura_sync_queue_depth — growing sustainedly means downstream cannot keep up.
aura_fragment_store_latency_seconds — object-store P99 above 500 ms typically means throttling.
aura_intent_log_append_latency_seconds — always the lowest-latency path; spikes here are alarming.

The built-in tracing endpoint emits OpenTelemetry traces for every request. For one-off investigations, enable verbose tracing on a single workload:

curl -X POST https://aura.internal.example.com/debug/tracing \
     -H "Authorization: Bearer $ADMIN" \
     -d '{"level":"debug","duration_seconds":60}'

The endpoint enables trace sampling at 100% for the specified window, then reverts automatically.

Common bottleneck patterns

Pattern: latency spikes correlated with large merges

Symptom: aura_request_duration_seconds P99 spikes every time a big merge lands. aura_postgres_pool_wait_seconds also spikes.

Diagnosis: the merge touches many functions, blows out the identity cache, causes a burst of cache-miss queries that exhaust the pool.

Fix: (1) Increase identity_cache_mb so the working set after the merge still fits. (2) Increase pool_size by 50%. (3) If the merge is predictable, warm the cache beforehand: aura admin cache warm --branch main.

Pattern: sync queue depth grows during agent storms

Symptom: when an agent fleet starts a parallel refactor, aura_sync_queue_depth grows monotonically. Eventually max_in_flight is reached and peers see backpressure.

Diagnosis: fan-out to downstream peers cannot keep up with push rate.

Fix: raise max_in_flight and batch_size. If the receiving peers are themselves the bottleneck, enable sync.compression = "zstd" to reduce bandwidth.

Pattern: Postgres CPU at 100%

Symptom: Postgres CPU saturates, overall latency rises, the Mothership's own CPU is modest.

Diagnosis: the database is the bottleneck, not the Mothership.

Fix: (1) Check slow query logs. The usual suspect is a zone-lookup query that has stopped using its index after a Postgres version bump; REINDEX fixes it. (2) Add a read replica and point identity-resolution reads at it (database.read_replica_url). (3) For the largest deployments, partition the intent log by repo.

Pattern: fragment store slow

Symptom: aura_fragment_store_latency_seconds P99 crosses 500 ms during working hours.

Diagnosis: object-store throttling. S3's per-prefix throughput caps are the usual culprit on larger deployments.

Fix: (1) Enable prefix sharding: fragment_store.prefix_sharding = 64. This spreads writes across prefixes and evades per-prefix throttles. (2) Increase fragment_cache_mb so hot reads do not touch the object store.

Pattern: cold start after restart

Symptom: right after a deploy, latencies are elevated for 5–15 minutes, then settle.

Diagnosis: caches are cold.

Fix: warm them explicitly during deploy, or add readiness checks that delay traffic routing until a warm threshold is reached:

[readiness]
require_cache_warm = true
warm_threshold_ratio = 0.85

Horizontal scaling

The Mothership is stateless. Scaling out is straightforward: add replicas behind the load balancer. The only state that needs coordination is the intent-log append sequence, which is serialized through Postgres; adding replicas does not change that property.

A typical horizontal layout for 300+ peers:

4 Mothership replicas, 8 physical cores and 32 GB RAM each.
HA Postgres with one primary and two read replicas.
Managed object storage with prefix sharding enabled.
L7 load balancer with sticky sessions on peer identity (helps cache locality).

Sticky sessions are not required for correctness — any replica can serve any peer — but they improve identity-cache hit ratios by keeping a given peer's traffic on a consistent replica.

Batch workloads and off-hours jobs

Migration imports, bulk re-indexing, and long-running audit exports should not contend with interactive traffic. Three options:

Rate-limit them: the CLI honors AURA_MAX_QPS for bulk operations.
Route to a dedicated replica with --mothership bulk.aura.internal.example.com. A separate replica sized for throughput, not latency, absorbs the load.
Schedule for off-peak via your orchestrator.

The monitoring dashboard

Naridon publishes a reference Grafana dashboard covering every metric above, with pre-configured alerts. It is included in the self-hosted deployment reference manifests and available as a standalone JSON at github.com/Naridon-Inc/aura/contrib/grafana.

Panels to pin at the top:

Request latency P50/P95/P99 (global SLO).
Postgres pool wait (most common tuning signal).
Identity cache hit ratio (silent killer when it drifts).
Sync queue depth (agent-storm canary).
Fragment store P99 (object-store throttling canary).
Intent-log append latency (never spike; alert if it does).

When to scale Postgres

Postgres is the ceiling on single-Mothership throughput. You have hit it when:

aura_postgres_pool_wait_seconds stays non-zero even with the pool at 128.
The Postgres instance is CPU-bound.
Query latency on the identity-resolution path exceeds 50 ms P95.

The remediation ladder, in order:

Add a read replica, point cache-miss reads at it.
Upgrade Postgres instance size (more vCPU, more RAM).
Partition the intent log by repository.
Engage Naridon professional services for custom shape analysis.