Backup & Recovery

Mothership snapshots, WAL replay, disaster recovery, and tested restoration time. Built for the day you actually need it.

Overview

Backup strategies are judged on restore, not on capture. Any system can produce a backup file. The question that matters is: when the primary storage is gone, how long does it take to stand a new Mothership back up, with intent log integrity intact, with the fragment store consistent, with developers able to push again?

For a reasonably provisioned deployment, the tested answer is under thirty minutes for a catastrophic-loss scenario, assuming your backups are reachable. This page documents the backup machinery, the recovery procedure, and the drill we recommend running quarterly.

What needs to be backed up

A Mothership deployment has three stateful components. Everything else is stateless and rebuildable from configuration.

  1. PostgreSQL catalog — function identities, zone ownership, intent log metadata, permissions state, team membership.
  2. Fragment store — compressed AST fragments referenced by the catalog.
  3. Configurationconfig.toml, permissions.toml, TLS certificate material, KMS key references.

The Mothership binary itself, its container images, and the tree-sitter grammars are reproducible from the release bundle and do not need to be backed up.

Backup strategy

The recommended strategy combines three mechanisms, each with a different recovery-point objective (RPO) and recovery-time objective (RTO):

| Mechanism | RPO | RTO | Purpose | | --- | --- | --- | --- | | Postgres continuous WAL archiving | < 1 minute | 15–30 minutes | Catastrophic loss recovery | | Mothership logical snapshots (daily) | 24 hours | 5–10 minutes | Point-in-time restore for human error | | Fragment store versioning (S3) | < 1 minute | Negligible | Accidental fragment deletion |

The three work together: WAL archiving provides the tightest RPO, daily logical snapshots provide fast restore for everyday mistakes, and S3 object versioning protects fragments from operator error on the object store.

Postgres WAL archiving

Configure continuous WAL archiving to an object store. Any well-tested tool works — pgBackRest, wal-g, or a managed Postgres service's built-in PITR. For self-hosted Postgres:

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'wal-g wal-push %p'
archive_timeout = 60

Confirm the archiver is keeping up with:

SELECT pg_last_wal_replay_lsn(),
       pg_current_wal_lsn(),
       pg_current_wal_lsn() - pg_last_wal_replay_lsn() AS replay_lag;

A replay lag consistently above a few WAL segments is an alert-worthy signal.

Mothership logical snapshots

The Mothership can produce a logical snapshot — a single archive that captures the catalog, a manifest of fragment references, and the current configuration. Snapshots are taken while the server is running; they are transactionally consistent and do not block writes.

aura-mothership snapshot create \
  --output s3://aura-backups-prod/snapshots/ \
  --label "nightly-$(date -u +%Y%m%d)"

Schedule this from cron, your orchestrator, or a Kubernetes CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: aura-snapshot-nightly
  namespace: aura
spec:
  schedule: "17 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: snapshot
              image: ghcr.io/naridon-inc/aura-mothership:0.14.1
              command: ["aura-mothership", "snapshot", "create",
                        "--output", "s3://aura-backups-prod/snapshots/"]

Snapshots are content-addressed; repeated snapshots of unchanged data share storage. A daily snapshot of a typical deployment adds a few hundred megabytes per day.

Fragment store versioning

Turn on object-store versioning and a sensible lifecycle policy.

{
  "Rules": [
    {
      "ID": "aura-fragments-retain",
      "Status": "Enabled",
      "NoncurrentVersionExpiration": { "NoncurrentDays": 90 }
    }
  ]
}

Versioning costs very little for the Aura fragment shape (most objects are written once and never overwritten) and eliminates an entire category of operator-error incidents.

Verifying backups

A backup you have not restored is not a backup. The snapshot command produces a manifest that the verifier can cross-check against:

aura-mothership snapshot verify \
  --snapshot s3://aura-backups-prod/snapshots/nightly-20260420.tar \
  --fragments s3://aura-fragments-prod
snapshot metadata signature: OK
catalog row count: 2,418,012 (expected 2,418,012)
fragment manifest entries: 14,812,339
fragments present: 14,812,339 / 14,812,339
hash-chain verification: valid
result: PASS

Run this weekly in CI. Alert if it ever fails.

Disaster recovery procedure

The following procedure restores a Mothership from catastrophic loss. It assumes backups are reachable and configuration is available in source control.

Step 1: provision replacement infrastructure

Stand up a new Kubernetes namespace, or a new VM, per Self-Hosted Deployment. Postgres can be a fresh instance; the fragment store points at the same bucket (the data is still there; only the catalog was lost).

Step 2: restore Postgres

Restore from the most recent logical snapshot, then roll WAL forward to the latest archived segment:

# Restore the snapshot's included pg_dump:
aura-mothership snapshot extract \
  --snapshot s3://aura-backups-prod/snapshots/nightly-20260420.tar \
  --extract-pg-dump /tmp/aura.dump

pg_restore -d postgres://aura@newdb.internal/aura /tmp/aura.dump

# Roll forward to latest point:
wal-g wal-fetch --latest /var/lib/postgresql/data
pg_ctl start

Step 3: reconcile fragment store

The restored catalog references fragments by content hash. If versioning preserved every fragment, this step is a no-op. If a fragment is missing, Aura marks the affected functions as "byte-only, fragment lost" rather than failing — preserving service while flagging the loss:

aura-mothership fragments reconcile --report /tmp/reconcile.json

The report lists any unrecoverable fragments. In practice, with versioning enabled, we have never seen a non-empty report in a real recovery.

Step 4: restart the Mothership

aura-mothership serve --config /etc/aura/config.toml

Confirm health:

curl -fsS https://aura.internal.example.com/healthz

Step 5: verify intent chain

aura audit verify --repo monorepo --full

The chain must verify end to end. If it does not, stop and investigate — a chain break is always an incident, never a routine recovery artifact.

Step 6: notify peers

Once the Mothership is healthy, CLIs will reconnect automatically. Users who were mid-push during the incident retry; their local queues drain. There is no data loss on the CLI side, because the CLI's commit is only considered durable once the Mothership has acknowledged it.

Tested recovery time

We run this drill quarterly against production-shaped deployments. Typical results:

| Catalog size | Fragment count | Wall-clock RTO | | --- | --- | --- | | 5 GB | 2 M | 6 minutes | | 30 GB | 12 M | 14 minutes | | 120 GB | 45 M | 27 minutes | | 400 GB | 180 M | 58 minutes |

The dominant term is pg_restore. WAL replay, fragment reconciliation, and mothership startup each contribute a few minutes at most.

Point-in-time restore

For recovering from a human error — a mis-run migration, an accidental deletion of a team's zone, a compromised admin account — the procedure is the same, but the WAL replay target is a specific timestamp instead of "latest":

wal-g backup-fetch LATEST /var/lib/postgresql/data
wal-g wal-fetch --until "2026-04-20 14:31:00+00" /var/lib/postgresql/data

After restore, verify the intent chain, then notify affected teams that any work committed after the target timestamp has been rolled back. In practice, this is rare — the intent log makes it easy to identify and reverse a single bad change without a full PITR.

Backup storage security

Backups are as sensitive as the primary store. Treat them as such:

  • Encrypt at rest with a KMS key separate from the primary fragment key. Compromise of the primary key must not compromise the backups.
  • Restrict access — the IAM principal that runs the snapshot job has write-only access; restore requires a separate, MFA-gated principal.
  • Cross-region replicate backups to a second region, with strict write-once semantics (S3 Object Lock, or equivalent).
  • Air-gap a copy if your threat model includes insider risk at the cloud provider; rotate it monthly.

The recovery drill

A drill is a full recovery into a scratch environment, on a scheduled cadence, with the result recorded in the intent log.

aura-mothership recovery-drill \
  --snapshot s3://aura-backups-prod/snapshots/nightly-20260420.tar \
  --target scratch-namespace-4218 \
  --log-result

The drill stands up a temporary Mothership, restores into it, runs chain verification, runs a standard set of synthetic peer operations, and records pass/fail as a signed intent record. Auditors love this artifact because it proves recovery was actually exercised, not merely documented.

Recommended cadence:

  • Weekly: snapshot verify (no restore).
  • Monthly: partial restore into scratch environment.
  • Quarterly: full DR drill with timed RTO, result logged.
  • Annually: cross-region failover exercise if you operate multi-region.

What is not covered by backup

Worth stating explicitly: backups cover Mothership state, not the Git repository itself. Your Git hosting (GitHub, GitLab, self-hosted) has its own backup posture. Aura's shadow branches are reproducible from the Git repository at any point, so loss of shadow-branch state alone does not require restore from backup — it requires a re-import, which is well-trodden (see Migration from Git).

Restore from fragment-store corruption

A subtler failure mode than catalog loss: a fragment-store bug, operator error, or silent storage corruption damages a subset of fragments while leaving the catalog intact. Symptoms are specific — some functions begin serving byte-level diffs where they previously served AST-level — and the fix differs from full disaster recovery.

The reconciler handles it:

aura-mothership fragments reconcile --deep --repair-from s3-version-prior-to 2026-04-20

With versioning enabled on the fragment store, --repair-from pulls prior object versions for fragments that fail integrity check, up to the specified cutoff. For fragments with no prior version available, the tool marks them "recompute on next push" — the next time a peer pushes the affected function, Aura recomputes the fragment from source.

Cross-region failover

Customers running the multi-region federation shape treat a full-region loss as the worst-case DR scenario. The federation replicates function identities and intent records between regions opportunistically, so a surviving region can be promoted to primary for a repository that was previously pinned to the lost region.

The failover procedure:

aura-mothership federation promote \
  --repo monorepo-eu \
  --new-primary us-east \
  --reason "eu-central region lost"

The promoted Mothership begins accepting pushes for the affected repository within a minute. Peers reconfigure automatically if they have a list of endpoints in their CLI config; otherwise, the CLI is pointed at the new primary via a single environment variable.

Failover has data-sovereignty implications worth planning for ahead of time: if a repository was pinned to the EU and is failed over to the US, code that was previously inside the EU boundary is now outside it. Contracts and compliance programs should contemplate this scenario and the customer should have a documented decision about whether to fail over or to accept temporary unavailability. See Data Sovereignty in the EU.

See Also