WAL Architecture

Durability before bandwidth, before features, before cleverness. If the laptop crashes mid-push, nothing is lost.

Live Sync is backed by a Write-Ahead Log (WAL). Every outbound push, every inbound apply, every conflict detection, and every resolution is first appended to an on-disk log before any network action or AST mutation is performed. If the process crashes, the kernel panics, or the Mothership disappears mid-flight, replaying the WAL on the next start reproduces the correct state. This page explains the log's on-disk format, the recovery semantics, the retention policy, and what happens under each class of failure. If you remember one thing: the WAL is the source of truth for Live Sync; the network is an optimization on top of it.

Overview

Real-time systems that promise "don't lose my work" have two choices. They can treat the network as authoritative and hope nothing goes wrong locally — the Google Docs model. Or they can treat local disk as authoritative and treat the network as a replication target. Aura picks the second because source code is not like a Google Doc. If Google Docs drops a character during a crash, you retype it. If Aura drops a function during a crash, you may not notice for a week, and by then you are debugging a phantom bug in production.

The WAL is a simple, append-only, crash-safe log in .aura/live/wal/. It is written with O_APPEND + fsync at configurable points, and it is never rewritten in place. Compaction happens by writing a new segment and atomically renaming. This is the design that database engines have been using since the 1970s, and it is the same design Aura uses because it works.

  .aura/live/wal/
    segment-000042.log   <- active, appending
    segment-000041.log   <- sealed
    segment-000040.log   <- sealed
    index.json           <- segment manifest, watermarks

How It Works

Every event the Live Sync engine cares about becomes a WAL record before it takes effect:

    user saves file
       │
       ▼
    scan detects fn body change
       │
       ▼
    append WAL record: OUTBOUND_PENDING { aura_id, body, hash }   ◀── fsync here
       │
       ▼
    send push to Mothership
       │
       ▼
    on ack:
       append WAL record: OUTBOUND_ACKED { record_ref }
       │
    on failure:
       retry with backoff, still using the WAL record

    Mothership pushes inbound body
       │
       ▼
    append WAL record: INBOUND_RECEIVED { aura_id, body, hash, author } ◀── fsync here
       │
       ▼
    check for conflict (base vs local vs remote)
       │
       ├── clean apply:
       │     apply AST change
       │     append WAL record: INBOUND_APPLIED { record_ref }
       │
       └── conflict:
             append WAL record: CONFLICT_RAISED { record_ref }
             stage sidecar; wait for user

Nothing happens to your working copy until the receive is durable on disk. Nothing leaves the machine until the intent is durable on disk. This eliminates an entire class of "it almost pushed" bugs.

Record format

Records are a tagged binary format. Simplified in Rust:

enum WalRecord {
    OutboundPending { id: u64, aura_id: [u8;32], hash: [u8;32], body_zstd: Vec<u8>, ts_ms: u64 },
    OutboundAcked   { id: u64, pending_ref: u64, ts_ms: u64 },
    OutboundFailed  { id: u64, pending_ref: u64, reason: String, ts_ms: u64 },
    InboundReceived { id: u64, aura_id: [u8;32], hash: [u8;32], body_zstd: Vec<u8>, author: PeerId, ts_ms: u64 },
    InboundApplied  { id: u64, recv_ref: u64, ts_ms: u64 },
    ConflictRaised  { id: u64, recv_ref: u64, local_hash: [u8;32], ts_ms: u64 },
    ConflictResolved{ id: u64, conflict_ref: u64, resolution: Resolution, ts_ms: u64 },
    Checkpoint      { id: u64, segment_watermarks: Vec<(u32,u64)>, ts_ms: u64 },
}

Each record has a monotonic id. Later records can reference earlier ones by id, which makes the log self-describing without requiring a separate state snapshot.

Crash safety

The interesting cases are the ones where something fails mid-step.

Crash after append, before network send

On start, the recovery pass scans from the last Checkpoint. It finds OutboundPending with no matching OutboundAcked or OutboundFailed. Those pushes are re-sent. The Mothership idempotently dedupes by (aura_id, hash), so a re-send after a successful-but-unacknowledged prior send produces no duplicate.

Crash after receive, before apply

Recovery finds InboundReceived with no matching InboundApplied or ConflictRaised. It re-runs the apply logic. Because the apply is deterministic (base hash + incoming hash + current working copy hash), the outcome is the same as if the crash had not happened.

Crash during apply

AST mutations are file-atomic: Aura writes the new file to a temp path, fsyncs, then renames. If the crash is before the rename, the old file is intact. If after, the new file is intact. The InboundApplied record is the tie-breaker — on recovery, if the record is present, the apply is considered done; if not, it is re-run.

Mothership disappears

Local WAL keeps growing with OutboundPending records that never get acked. Retries use exponential backoff. On reconnect, the backlog drains. During the outage, Live Sync cleanly degrades: local edits still write to the WAL, and the user sees mothership: offline in status. Nothing is lost.

Disk full

The OUTBOUND_PENDING append fails. Live Sync returns an error to the save pipeline — but importantly, your edit to the working copy is still on disk through the normal file save path; only the sync record failed. On next tick, the scan will re-detect the dirty function and try again. No work is lost; only the push is delayed.

Retention

The WAL is not infinite. Default retention:

[live.wal]
# Keep at least this much, regardless of age.
min_segments = 8

# Keep sealed segments for at least this long.
min_age_days = 14

# Never exceed this total size on disk.
max_total_mb = 512

Compaction runs periodically. A compaction pass:

Finds the oldest sealed segment whose records are all either *Acked or *Applied — i.e. nothing still in flight.
Writes a summary record (current per-function hashes) to the active segment.
Removes the old segment.

After compaction, the log still has enough history to answer "what was the last hash we agreed on for function X" without growing forever.

Inspection

aura live wal status
aura live wal tail --n 50
aura live wal show <record_id>
aura live wal verify          # check segment checksums

Sample output:

active segment:   segment-000042.log  (12 MB, 4,201 records)
sealed segments:  7
last checkpoint:  42s ago
oldest pending:   0 outbound, 0 inbound
integrity:        OK (all segment checksums match)

Gotcha: Do not edit files in .aura/live/wal/ by hand. The checksums will not re-sign themselves. If you think the WAL is corrupt, run aura live wal verify and, if it reports damage, aura live wal recover which salvages the intact prefix and discards the rest.

Data retention and privacy

The WAL contains function bodies in compressed form. That includes any secrets those bodies contain if you have not enabled masking — see Live Sync privacy. The WAL lives under .aura/live/wal/ which is in the default .gitignore, so it will not leak into your Git history, but it is on your disk until compaction.

To purge older history aggressively:

aura live wal compact --max-age-days 3

Config

[live.wal]
dir = ".aura/live/wal"
fsync = "every_record"       # or "every_tick", "every_5s"
segment_size_mb = 16
min_segments = 8
min_age_days = 14
max_total_mb = 512

fsync = "every_record" is the safest and the default. The cost is a fsync per push record — microseconds on NVMe, milliseconds on spinning rust. For laptops on battery, every_tick is a reasonable compromise.

Troubleshooting

WAL grew past max_total_mb. Compaction is falling behind, usually because there are long-pending outbound records (Mothership unreachable). Fix connectivity, let retries drain, compaction will catch up.

Recovery on start is slow. The log tail is large. Raise checkpoint_every_records or shorten the scan window.

aura live wal verify reports a bad checksum. Something wrote to the segment outside Aura. Salvage with aura live wal recover and investigate the system for disk issues.

Checkpoints

Every checkpoint_every_records records (default 1000) or checkpoint_every_ms milliseconds (default 60000), Aura writes a Checkpoint record that summarizes the current per-function hash state. On recovery, the scan starts from the last checkpoint instead of segment zero. This keeps recovery time bounded.

A checkpoint record is small — a map of aura_id -> current_hash with segment watermarks. On a 100k-function repo with 30% warm (recently edited), a checkpoint is about 1 MB. It is written to the active segment, so it compacts away with the rest.

Why not use SQLite?

A reasonable question: Aura already uses SQLite for some of its state; why a custom WAL?

Three reasons:

Fsync behavior. SQLite's journal mode and our needs overlap only partially. We want append-only, no in-place updates, and a single fsync per record. Emulating that in SQLite means fighting its engine.
Recovery model. Our recovery is "replay from last checkpoint." SQLite's WAL is a different concept — it is not meant to be read back by the application.
Auditability. The Live Sync WAL is a user-facing artifact. Users should be able to tail -f equivalent it. A SQLite database is opaque to most tools.

Plus, the WAL is small, append-only, and easy to reason about. It is ~1200 lines of Rust. Worth the carry.

Concurrent writers

Only one process writes to the WAL at a time. This is enforced by an exclusive file lock on .aura/live/wal/lock. If two aura processes try to start Live Sync concurrently, the second exits with a clear error. Read-only operations (aura live wal status, aura live wal tail) use a shared lock and coexist with the writer.

If a process dies holding the lock, the lock is OS-released. On next start, Aura runs recovery as normal.