Scaling Mothership

From a 5-person startup to a 500-person engineering org. What breaks first and how to tune around it.

Overview

Mothership is designed to scale vertically on a single box, not horizontally across a cluster. This is a deliberate choice. A single well-provisioned Mothership host comfortably handles hundreds of concurrent peers. When you outgrow that — or when geography forces your hand — you run multiple Motherships in a mesh topology rather than sharding one logical Mothership across machines.

This page covers hardware sizing at three reference points (5, 50, 500 peers), the specific knobs that matter, and how to tell when you are about to hit a limit.

Reference Sizes

5 peers: any machine

At this scale you have essentially no problem. A Raspberry Pi 4 (4 GB RAM), a spare laptop, or the cheapest cloud VM all work.

CPU: 1 core is enough.
RAM: 1 GB resident, 2 GB working.
Disk: 10 GB SSD. The WAL and semantic index together will be well under 1 GB.
Network: any residential uplink. Traffic is a few KB/s steady-state.

Expected behavior: almost everything is sub-millisecond. You will not notice Mothership is running.

50 peers: a small VM

This is the comfortable middle. Most teams live here.

CPU: 2 cores.
RAM: 4 GB.
Disk: 50 GB SSD. Budget for a year of history and the WAL.
Network: 100 Mbps symmetric is more than enough.

Mothership's memory footprint at 50 connected peers is around 500 MB resident plus whatever the semantic index needs. CPU sits in single-digit percent most of the time, spiking when large batches of pushes arrive.

Typical deployments: a DigitalOcean $24/month droplet, a Hetzner CX22, a dedicated Mac mini, a small corporate VM.

500 peers: a real server

This is the upper end we have tested. Above this, move to a mesh.

CPU: 8 cores.
RAM: 16 GB.
Disk: 200 GB NVMe SSD. Spinning disks will bottleneck the WAL.
Network: 1 Gbps. Not usually saturated, but you want headroom for fanout bursts.

At 500 peers, the Mothership manages roughly 500 persistent TLS connections, processes 50–200 events per second steady-state, and fans each event out to all connected subscribers.

Operationally, at this scale you want:

A load balancer or frontend proxy is not needed. Mothership handles the connections itself.
Real monitoring. Prometheus/Datadog on /metrics (see persistent daemon).
Tuned OS limits (ulimit -n 65536, kernel TCP buffers).
Backup of the Mothership's data directory nightly.

What Scales How

Not every dimension scales linearly with peer count.

Resource	Scales with	Notes
Active TCP connections	Peer count	One control connection per peer.
CPU (steady state)	Event rate	Event rate is a function of team activity, not team size.
CPU (fanout bursts)	Peers × event rate	Each event must fan out to every subscribed peer.
RAM	Peer count × session state + index size	Session state per peer is ~1 MB. Index grows with codebase, not team.
Disk	History retention	WAL grows at the event rate. Old segments compact.
Network out	Peers × event size	Dominant when big function bodies get pushed.

The practical upshot: a quiet team of 500 is cheaper to host than a loud team of 50. Monitor event rate, not just peer count.

Thread Pool Tuning

Mothership uses a handful of thread pools for different classes of work. Defaults are conservative; at scale you will want to raise them.

[threads]
# accept and multiplex incoming connections
io_workers = 4

# apply WAL events, compute fanout
sync_workers = 8

# respond to HTTP health/metrics
http_workers = 2

# background tasks: compaction, key rotation, audit flush
maintenance_workers = 2

Reasonable defaults at each reference size:

Peers	io	sync	http	maintenance
5	2	2	1	1
50	4	4	2	2
500	8	16	2	4

If aura mothership status shows sync backlog > 0 sustainedly, raise sync_workers. If CPU is pinned and backlog is still growing, you're past this Mothership's capacity and it's time to federate.

Cache Sizing

The Mothership holds several caches. The semantic index is the most memory-hungry.

[cache]
# Hot function bodies kept in memory
function_body_cache_mb = 512

# Recently accessed intent log entries
intent_cache_entries = 50000

# Peer session metadata
peer_session_cache = 2000

# TLS session tickets
tls_ticket_cache = 4000

At 500 peers on a 16 GB host, allocating 4 GB to function_body_cache_mb is reasonable. The cache is pure performance — a miss falls back to disk — so bigger is better until you crowd out the OS page cache.

Rule of thumb: leave at least 2 GB for the OS page cache and kernel buffers. Beyond that, give the rest to Mothership.

WAL Tuning

The WAL is the hottest write path. Tuning:

[wal]
# Size of each segment file before rolling
segment_size = "64MB"

# fsync policy
fsync = "batch"    # or "always", "never"
fsync_interval_ms = 10

# Retention
max_age = "180d"
max_size = "20GB"

fsync modes:

always: every write is fsynced before acknowledging. Safest. Slowest. Use for small teams on good SSDs.
batch: fsync every fsync_interval_ms. Default. Good balance.
never: rely on OS flush. Fastest. Last 10–30 seconds of events vulnerable to OS crash. Only use if you have a strong UPS and replicated Mothership.

Our default is batch at 10ms. On NVMe SSDs this produces about 100 fsyncs per second, which is well within the performance envelope of any modern NVMe.

OS-Level Limits

On Linux, several defaults are too conservative for a 500-peer Mothership.

File descriptors

Each peer connection consumes at least one FD. Plus the WAL segments, log files, metrics scrape connections. 65536 is a safe number:

# /etc/security/limits.d/aura.conf
aura soft nofile 65536
aura hard nofile 65536

Or in the systemd unit:

LimitNOFILE=65536

TCP buffers and backlog

# /etc/sysctl.d/99-aura.conf
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

Ephemeral port range

If the Mothership initiates outbound connections (federation), raise the ephemeral range:

net.ipv4.ip_local_port_range = 10000 65000

Apply with sudo sysctl -p /etc/sysctl.d/99-aura.conf.

Detecting Capacity Problems

Signs you are approaching the ceiling:

Sync backlog growing. Check aura mothership status:

sync backlog: 1247 events

A non-zero number that keeps growing means sync workers can't keep up. Tune sync_workers or federate.

Fanout latency rising. Metric: aura_fanout_latency_seconds. If the p95 is above 500ms, peers are experiencing delay between their teammate's push and their own impact alert.

Connection reset rate spiking. Metric: aura_peer_disconnects_total. If this is rising, your OS is likely dropping connections because of FD exhaustion or backlog overflow.

Memory growth. RES in top. If steady-state RAM keeps climbing without a plateau, you may have hit a cache growing unboundedly — raise explicit caps in [cache] so it at least plateaus.

When to Federate

If any of the following is true, set up a second Mothership:

You are consistently above 70% CPU on an 8-core host.
Fanout p95 is above 500ms and raising thread pools doesn't help.
You have a meaningful population of developers more than 100ms round-trip from the Mothership.
You need higher availability than a single host can provide.

Federation is covered in team topology. A rough rule: every region or every 300 peers gets its own Mothership.

Backup and Disaster Recovery

At scale, the Mothership holds the canonical team state — even though every peer has a replica, the Mothership is the most convenient place to recover from. Back it up.

What to back up:

~/.config/aura/mothership/ (keys, config, TLS material)
Data directory (/var/lib/aura by convention, or wherever [storage] path points)
WAL segments (same directory)

Recommended schedule: daily incremental snapshots with 14-day retention, weekly full with 90-day retention. Aura's data directory is safe to snapshot while running — the WAL is append-only and segment files are either active or sealed. A point-in-time snapshot captures a consistent state.

Recovery from backup:

sudo systemctl stop aura-mothership
sudo rsync -av backup/aura-mothership/ /var/lib/aura/
sudo systemctl start aura-mothership

On start, Mothership replays the WAL, reconciles with peers as they reconnect, and returns to service. Expect a few minutes of reconcile activity.

Example: Configuring for 300 Peers

Concrete numbers for a team of 300 active developers on one Mothership:

[mothership]
bind = "0.0.0.0"
port = 7777

[threads]
io_workers = 6
sync_workers = 12
http_workers = 2
maintenance_workers = 3

[cache]
function_body_cache_mb = 2048
intent_cache_entries = 100000
peer_session_cache = 1200

[wal]
segment_size = "128MB"
fsync = "batch"
fsync_interval_ms = 10
max_size = "30GB"

[limits]
max_peers = 500
max_concurrent_pushes = 64

On a 4-core, 8 GB VM with NVMe, this configuration runs at roughly 15% CPU and 3.5 GB RAM steady-state, with headroom for bursts. A 3-core, 6 GB allocation would also work but with less margin.

Beyond 500

We have not tested teams above 500 peers on a single Mothership. At that scale, federation is not optional; it is the right architecture. Two Motherships of 250 peers each, federated, behave better than one Mothership of 500. Latency inside each region drops, failure domains shrink, and you have a more graceful path to further growth.

If you are seriously considering a single-Mothership deployment above 500 peers, we would like to hear from you — either you have a use case we can learn from, or you should be federating. Most likely the latter.