# Scaling Mothership _From a 5-person startup to a 500-person engineering org. What breaks first and how to tune around it._ ## Overview Mothership is designed to scale vertically on a single box, not horizontally across a cluster. This is a deliberate choice. A single well-provisioned Mothership host comfortably handles hundreds of concurrent peers. When you outgrow that — or when geography forces your hand — you run multiple Motherships in a [mesh topology](/team-topology#mesh) rather than sharding one logical Mothership across machines. This page covers hardware sizing at three reference points (5, 50, 500 peers), the specific knobs that matter, and how to tell when you are about to hit a limit. ## Reference Sizes ### 5 peers: any machine At this scale you have essentially no problem. A Raspberry Pi 4 (4 GB RAM), a spare laptop, or the cheapest cloud VM all work. - **CPU**: 1 core is enough. - **RAM**: 1 GB resident, 2 GB working. - **Disk**: 10 GB SSD. The WAL and semantic index together will be well under 1 GB. - **Network**: any residential uplink. Traffic is a few KB/s steady-state. Expected behavior: almost everything is sub-millisecond. You will not notice Mothership is running. ### 50 peers: a small VM This is the comfortable middle. Most teams live here. - **CPU**: 2 cores. - **RAM**: 4 GB. - **Disk**: 50 GB SSD. Budget for a year of history and the WAL. - **Network**: 100 Mbps symmetric is more than enough. Mothership's memory footprint at 50 connected peers is around 500 MB resident plus whatever the semantic index needs. CPU sits in single-digit percent most of the time, spiking when large batches of pushes arrive. Typical deployments: a DigitalOcean $24/month droplet, a Hetzner CX22, a dedicated Mac mini, a small corporate VM. ### 500 peers: a real server This is the upper end we have tested. Above this, move to a mesh. - **CPU**: 8 cores. - **RAM**: 16 GB. - **Disk**: 200 GB NVMe SSD. Spinning disks will bottleneck the WAL. - **Network**: 1 Gbps. Not usually saturated, but you want headroom for fanout bursts. At 500 peers, the Mothership manages roughly 500 persistent TLS connections, processes 50–200 events per second steady-state, and fans each event out to all connected subscribers. Operationally, at this scale you want: - A load balancer or frontend proxy is **not** needed. Mothership handles the connections itself. - Real monitoring. Prometheus/Datadog on `/metrics` (see [persistent daemon](/persistent-daemon)). - Tuned OS limits (`ulimit -n 65536`, kernel TCP buffers). - Backup of the Mothership's data directory nightly. ## What Scales How Not every dimension scales linearly with peer count. | Resource | Scales with | Notes | |---|---|---| | Active TCP connections | Peer count | One control connection per peer. | | CPU (steady state) | Event rate | Event rate is a function of team activity, not team size. | | CPU (fanout bursts) | Peers × event rate | Each event must fan out to every subscribed peer. | | RAM | Peer count × session state + index size | Session state per peer is ~1 MB. Index grows with codebase, not team. | | Disk | History retention | WAL grows at the event rate. Old segments compact. | | Network out | Peers × event size | Dominant when big function bodies get pushed. | The practical upshot: a quiet team of 500 is cheaper to host than a loud team of 50. Monitor event rate, not just peer count. ## Thread Pool Tuning Mothership uses a handful of thread pools for different classes of work. Defaults are conservative; at scale you will want to raise them. ```toml [threads] # accept and multiplex incoming connections io_workers = 4 # apply WAL events, compute fanout sync_workers = 8 # respond to HTTP health/metrics http_workers = 2 # background tasks: compaction, key rotation, audit flush maintenance_workers = 2 ``` Reasonable defaults at each reference size: | Peers | io | sync | http | maintenance | |---|---|---|---|---| | 5 | 2 | 2 | 1 | 1 | | 50 | 4 | 4 | 2 | 2 | | 500 | 8 | 16 | 2 | 4 | If `aura mothership status` shows `sync backlog > 0` sustainedly, raise `sync_workers`. If CPU is pinned and backlog is still growing, you're past this Mothership's capacity and it's time to federate. ## Cache Sizing The Mothership holds several caches. The semantic index is the most memory-hungry. ```toml [cache] # Hot function bodies kept in memory function_body_cache_mb = 512 # Recently accessed intent log entries intent_cache_entries = 50000 # Peer session metadata peer_session_cache = 2000 # TLS session tickets tls_ticket_cache = 4000 ``` At 500 peers on a 16 GB host, allocating 4 GB to `function_body_cache_mb` is reasonable. The cache is pure performance — a miss falls back to disk — so bigger is better until you crowd out the OS page cache. Rule of thumb: leave at least 2 GB for the OS page cache and kernel buffers. Beyond that, give the rest to Mothership. ## WAL Tuning The WAL is the hottest write path. Tuning: ```toml [wal] # Size of each segment file before rolling segment_size = "64MB" # fsync policy fsync = "batch" # or "always", "never" fsync_interval_ms = 10 # Retention max_age = "180d" max_size = "20GB" ``` **`fsync` modes**: - `always`: every write is fsynced before acknowledging. Safest. Slowest. Use for small teams on good SSDs. - `batch`: fsync every `fsync_interval_ms`. Default. Good balance. - `never`: rely on OS flush. Fastest. Last 10–30 seconds of events vulnerable to OS crash. Only use if you have a strong UPS and replicated Mothership. Our default is `batch` at 10ms. On NVMe SSDs this produces about 100 fsyncs per second, which is well within the performance envelope of any modern NVMe. ## OS-Level Limits On Linux, several defaults are too conservative for a 500-peer Mothership. ### File descriptors Each peer connection consumes at least one FD. Plus the WAL segments, log files, metrics scrape connections. 65536 is a safe number: ```text # /etc/security/limits.d/aura.conf aura soft nofile 65536 aura hard nofile 65536 ``` Or in the systemd unit: ```text LimitNOFILE=65536 ``` ### TCP buffers and backlog ```text # /etc/sysctl.d/99-aura.conf net.core.somaxconn = 4096 net.ipv4.tcp_max_syn_backlog = 4096 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 ``` ### Ephemeral port range If the Mothership initiates outbound connections (federation), raise the ephemeral range: ```text net.ipv4.ip_local_port_range = 10000 65000 ``` Apply with `sudo sysctl -p /etc/sysctl.d/99-aura.conf`. ## Detecting Capacity Problems Signs you are approaching the ceiling: **Sync backlog growing.** Check `aura mothership status`: ```text sync backlog: 1247 events ``` A non-zero number that keeps growing means sync workers can't keep up. Tune `sync_workers` or federate. **Fanout latency rising.** Metric: `aura_fanout_latency_seconds`. If the p95 is above 500ms, peers are experiencing delay between their teammate's push and their own impact alert. **Connection reset rate spiking.** Metric: `aura_peer_disconnects_total`. If this is rising, your OS is likely dropping connections because of FD exhaustion or backlog overflow. **Memory growth.** `RES` in top. If steady-state RAM keeps climbing without a plateau, you may have hit a cache growing unboundedly — raise explicit caps in `[cache]` so it at least plateaus. ## When to Federate If any of the following is true, set up a second Mothership: - You are consistently above 70% CPU on an 8-core host. - Fanout p95 is above 500ms and raising thread pools doesn't help. - You have a meaningful population of developers more than 100ms round-trip from the Mothership. - You need higher availability than a single host can provide. Federation is covered in [team topology](/team-topology). A rough rule: **every region or every 300 peers gets its own Mothership.** ## Backup and Disaster Recovery At scale, the Mothership holds the canonical team state — even though every peer has a replica, the Mothership is the most convenient place to recover from. Back it up. What to back up: - `~/.config/aura/mothership/` (keys, config, TLS material) - Data directory (`/var/lib/aura` by convention, or wherever `[storage] path` points) - WAL segments (same directory) Recommended schedule: daily incremental snapshots with 14-day retention, weekly full with 90-day retention. Aura's data directory is safe to snapshot while running — the WAL is append-only and segment files are either active or sealed. A point-in-time snapshot captures a consistent state. Recovery from backup: ```bash sudo systemctl stop aura-mothership sudo rsync -av backup/aura-mothership/ /var/lib/aura/ sudo systemctl start aura-mothership ``` On start, Mothership replays the WAL, reconciles with peers as they reconnect, and returns to service. Expect a few minutes of reconcile activity. ## Example: Configuring for 300 Peers Concrete numbers for a team of 300 active developers on one Mothership: ```toml [mothership] bind = "0.0.0.0" port = 7777 [threads] io_workers = 6 sync_workers = 12 http_workers = 2 maintenance_workers = 3 [cache] function_body_cache_mb = 2048 intent_cache_entries = 100000 peer_session_cache = 1200 [wal] segment_size = "128MB" fsync = "batch" fsync_interval_ms = 10 max_size = "30GB" [limits] max_peers = 500 max_concurrent_pushes = 64 ``` On a 4-core, 8 GB VM with NVMe, this configuration runs at roughly 15% CPU and 3.5 GB RAM steady-state, with headroom for bursts. A 3-core, 6 GB allocation would also work but with less margin. ## Beyond 500 We have not tested teams above 500 peers on a single Mothership. At that scale, federation is not optional; it is the right architecture. Two Motherships of 250 peers each, federated, behave better than one Mothership of 500. Latency inside each region drops, failure domains shrink, and you have a more graceful path to further growth. If you are seriously considering a single-Mothership deployment above 500 peers, we would like to hear from you — either you have a use case we can learn from, or you should be federating. Most likely the latter. ## Next Steps - [Federate Motherships in a mesh](/team-topology#mesh) - [Set up monitoring and health checks](/persistent-daemon) - [Diagnose specific failure modes](/mothership-troubleshooting)