Performance Tuning
Thread pools, cache sizes, batch sizes, and hardware recommendations for a busy Mothership. How to profile, where to look, and what each knob actually changes.
Overview
Aura's Mothership is, at its core, a streaming system. Peers push function-level updates, the server appends intents, resolves function identities, updates zone state, and fans changes out to other interested peers. At small scale, everything fits comfortably on a single modest instance; at several hundred concurrent peers, the shape of the work — many small writes, fan-out reads, a hot identity cache — starts to reward careful tuning.
This page describes how to read the telemetry, which knob affects which metric, and the hardware shapes we have seen work. The goal is to give a platform team enough of a mental model to tune without guessing.
Default sizing
Out of the box, the Mothership targets a deployment comfortably handling 50 concurrent peers with headroom. Defaults:
[server]
workers = 8 # request handler threads
accept_backlog = 1024
[database]
pool_size = 16
statement_timeout_ms = 10000
[sync]
batch_size = 128
flush_interval_ms = 500
max_in_flight = 1024
[cache]
identity_cache_mb = 256
fragment_cache_mb = 1024
[runtime]
tokio_worker_threads = 0 # 0 = auto, one per physical core
These are conservative. They work well for teams up to about 75 engineers. Beyond that, expect to tune at least pool_size, identity_cache_mb, and max_in_flight.
Hardware recommendations
The recommendations below come from production deployments we help operate. They are starting points, not hard requirements; your workload shape will pull you in one direction or another.
| Peers (concurrent) | CPU (physical cores) | RAM | Postgres shape | Network | | --- | --- | --- | --- | --- | | Up to 50 | 4 | 8 GB | Single instance, 2 vCPU, 4 GB RAM | 1 Gbit/s | | 50–150 | 8 | 16 GB | Single instance, 4 vCPU, 16 GB RAM | 1 Gbit/s | | 150–300 | 16 | 32 GB | Primary + read replica, 8 vCPU each | 10 Gbit/s | | 300–500 | 32 | 64 GB | Primary + 2 read replicas, 16 vCPU each | 10 Gbit/s | | 500+ | Horizontal scale recommended | | Managed HA Postgres, 32 vCPU+ | 10 Gbit/s |
Concurrent peers is the right axis, not engineer count. An engineer who is pushing intensively counts as one peer; an engineer at a standup counts as less than one. Multiply engineer count by 0.4 as a rule of thumb for "typical concurrent."
Agent fleets shift the math. A fleet of 200 autonomous agents that push every few seconds during active hours behaves like 200 engineers actively pushing — allocate hardware accordingly.
The four knobs that matter most
If you tune nothing else, tune these four. They account for the large majority of performance issues we see.
1. Postgres pool size
Every Mothership worker that serves a request checks out a database connection. If the pool is undersized, workers queue waiting for a connection, latency rises, and the aura_postgres_pool_wait_seconds metric becomes non-zero.
Rule of thumb: pool_size = max(16, workers * 2). Above 100, you are likely bottlenecked on Postgres, not on the pool; scale Postgres first.
[database]
pool_size = 64
statement_timeout_ms = 15000
2. Identity cache size
Function identity resolution is the hottest path in the Mothership. The cache holds recently-resolved identities so that repeated pushes touching the same functions do not re-scan history.
The correct size is "large enough that the working set fits." You can tell by watching aura_identity_cache_hit_ratio: below 0.95 sustained means the cache is small. A well-provisioned deployment sees ratios above 0.99.
[cache]
identity_cache_mb = 2048
Above a few gigabytes, returns diminish. At that point, a bigger cache will not help a workload; a Postgres read replica will.
3. Sync batch size
Peers send function-level updates one at a time, but the Mothership batches them before writing to Postgres and pushing to downstream peers. Small batches mean more round trips and higher per-commit overhead; large batches mean higher tail latency for the last update in a batch.
Defaults are tuned for interactive feel. For agent-heavy workloads where latency matters less than throughput, increase the batch:
[sync]
batch_size = 512
flush_interval_ms = 250
max_in_flight = 4096
4. Tokio worker threads
The async runtime defaults to one worker per physical core. On hyperthreaded systems running exclusively the Mothership, leaving the default is correct. On mixed workloads (Mothership co-resident with other services), cap it:
[runtime]
tokio_worker_threads = 8
Profiling a busy Mothership
The Mothership ships a set of metrics chosen for the questions platform teams actually ask. The ones to alert on:
aura_request_duration_seconds— P50, P95, P99 of overall request latency. The top-level SLO signal.aura_postgres_pool_wait_seconds— any sustained non-zero value points at pool exhaustion.aura_identity_cache_hit_ratio— below 0.95 means the cache is undersized or the working set is unusual.aura_sync_queue_depth— growing sustainedly means downstream cannot keep up.aura_fragment_store_latency_seconds— object-store P99 above 500 ms typically means throttling.aura_intent_log_append_latency_seconds— always the lowest-latency path; spikes here are alarming.
The built-in tracing endpoint emits OpenTelemetry traces for every request. For one-off investigations, enable verbose tracing on a single workload:
curl -X POST https://aura.internal.example.com/debug/tracing \
-H "Authorization: Bearer $ADMIN" \
-d '{"level":"debug","duration_seconds":60}'
The endpoint enables trace sampling at 100% for the specified window, then reverts automatically.
Common bottleneck patterns
Pattern: latency spikes correlated with large merges
Symptom: aura_request_duration_seconds P99 spikes every time a big merge lands. aura_postgres_pool_wait_seconds also spikes.
Diagnosis: the merge touches many functions, blows out the identity cache, causes a burst of cache-miss queries that exhaust the pool.
Fix: (1) Increase identity_cache_mb so the working set after the merge still fits. (2) Increase pool_size by 50%. (3) If the merge is predictable, warm the cache beforehand: aura admin cache warm --branch main.
Pattern: sync queue depth grows during agent storms
Symptom: when an agent fleet starts a parallel refactor, aura_sync_queue_depth grows monotonically. Eventually max_in_flight is reached and peers see backpressure.
Diagnosis: fan-out to downstream peers cannot keep up with push rate.
Fix: raise max_in_flight and batch_size. If the receiving peers are themselves the bottleneck, enable sync.compression = "zstd" to reduce bandwidth.
Pattern: Postgres CPU at 100%
Symptom: Postgres CPU saturates, overall latency rises, the Mothership's own CPU is modest.
Diagnosis: the database is the bottleneck, not the Mothership.
Fix: (1) Check slow query logs. The usual suspect is a zone-lookup query that has stopped using its index after a Postgres version bump; REINDEX fixes it. (2) Add a read replica and point identity-resolution reads at it (database.read_replica_url). (3) For the largest deployments, partition the intent log by repo.
Pattern: fragment store slow
Symptom: aura_fragment_store_latency_seconds P99 crosses 500 ms during working hours.
Diagnosis: object-store throttling. S3's per-prefix throughput caps are the usual culprit on larger deployments.
Fix: (1) Enable prefix sharding: fragment_store.prefix_sharding = 64. This spreads writes across prefixes and evades per-prefix throttles. (2) Increase fragment_cache_mb so hot reads do not touch the object store.
Pattern: cold start after restart
Symptom: right after a deploy, latencies are elevated for 5–15 minutes, then settle.
Diagnosis: caches are cold.
Fix: warm them explicitly during deploy, or add readiness checks that delay traffic routing until a warm threshold is reached:
[readiness]
require_cache_warm = true
warm_threshold_ratio = 0.85
Horizontal scaling
The Mothership is stateless. Scaling out is straightforward: add replicas behind the load balancer. The only state that needs coordination is the intent-log append sequence, which is serialized through Postgres; adding replicas does not change that property.
A typical horizontal layout for 300+ peers:
- 4 Mothership replicas, 8 physical cores and 32 GB RAM each.
- HA Postgres with one primary and two read replicas.
- Managed object storage with prefix sharding enabled.
- L7 load balancer with sticky sessions on peer identity (helps cache locality).
Sticky sessions are not required for correctness — any replica can serve any peer — but they improve identity-cache hit ratios by keeping a given peer's traffic on a consistent replica.
Batch workloads and off-hours jobs
Migration imports, bulk re-indexing, and long-running audit exports should not contend with interactive traffic. Three options:
- Rate-limit them: the CLI honors
AURA_MAX_QPSfor bulk operations. - Route to a dedicated replica with
--mothership bulk.aura.internal.example.com. A separate replica sized for throughput, not latency, absorbs the load. - Schedule for off-peak via your orchestrator.
The monitoring dashboard
Naridon publishes a reference Grafana dashboard covering every metric above, with pre-configured alerts. It is included in the self-hosted deployment reference manifests and available as a standalone JSON at github.com/Naridon-Inc/aura/contrib/grafana.
Panels to pin at the top:
- Request latency P50/P95/P99 (global SLO).
- Postgres pool wait (most common tuning signal).
- Identity cache hit ratio (silent killer when it drifts).
- Sync queue depth (agent-storm canary).
- Fragment store P99 (object-store throttling canary).
- Intent-log append latency (never spike; alert if it does).
When to scale Postgres
Postgres is the ceiling on single-Mothership throughput. You have hit it when:
aura_postgres_pool_wait_secondsstays non-zero even with the pool at 128.- The Postgres instance is CPU-bound.
- Query latency on the identity-resolution path exceeds 50 ms P95.
The remediation ladder, in order:
- Add a read replica, point cache-miss reads at it.
- Upgrade Postgres instance size (more vCPU, more RAM).
- Partition the intent log by repository.
- Engage Naridon professional services for custom shape analysis.