# Mothership Troubleshooting _Port conflicts, firewall black holes, TLS failures, JWT expiry. What breaks in the wild and how to fix it._ ## Overview Most Mothership problems fall into four buckets: the port, the network path, the certificate, or the token. This page is organized by symptom — what you see when something is wrong — and walks through diagnosis and fix. It is not exhaustive. It covers the issues we have seen most often across production deployments. Before filing anything upstream, always run: ```bash aura mothership status aura doctor ``` `aura doctor` performs a battery of self-checks and prints a diagnosis. A lot of the problems below it will catch automatically. ## Symptom: "address already in use" on start Mothership failed to bind because something else is holding the port. ```bash aura mothership start ``` ```text error: failed to bind 0.0.0.0:7777 caused by: address already in use ``` ### Diagnosis ```bash sudo lsof -i :7777 # or sudo ss -tlnp | grep 7777 ``` You will usually see either a previous Mothership process that didn't clean up, or an unrelated service that grabbed the port. ### Fix If it is a stale Mothership: ```bash aura mothership stop # or, if the CLI can't find it sudo kill $(pgrep -f 'aura mothership') ``` If it is a different service, pick a different port: ```bash aura mothership start --port 8443 ``` If you are on Linux and the port is below 1024, you need either root or the `CAP_NET_BIND_SERVICE` capability on the binary: ```bash sudo setcap cap_net_bind_service=+ep $(which aura) ``` ## Symptom: peers can't reach the Mothership `aura mothership ping` from a peer times out or refuses. ```text ping mothership.acme.internal:7777 tcp connect: timeout ``` ### Diagnosis The three causes, in order of likelihood: 1. **Firewall blocking.** The Mothership is listening but packets never arrive. 2. **Bind address is wrong.** Mothership bound to `127.0.0.1` when it should have bound to `0.0.0.0` or a specific LAN IP. 3. **DNS resolving to the wrong address.** Peer reaches _an_ IP, but not the Mothership's. Check from the Mothership host: ```bash aura mothership status ``` Look at `listening`. If it says `127.0.0.1:7777`, that's your problem. ```bash sudo ss -tlnp | grep aura ``` If that shows `LISTEN 0 128 0.0.0.0:7777`, Mothership is listening on all interfaces. The problem is network-side. From a peer, try a basic connectivity check: ```bash nc -zv mothership.acme.internal 7777 ``` `refused` means something is responding. `timeout` means packets are being black-holed, usually a firewall. ### Fix **Firewall** (common culprits): `ufw` on Ubuntu, `firewalld` on RHEL/Fedora, `iptables` rules, cloud security groups. ```bash # Ubuntu sudo ufw allow 7777/tcp # RHEL/Fedora sudo firewall-cmd --permanent --add-port=7777/tcp sudo firewall-cmd --reload # AWS security group: add inbound TCP 7777 from the CIDR of your team ``` **Wrong bind**: fix the config file or `--bind` flag. See [starting mothership](/starting-mothership). **DNS**: run `dig mothership.acme.internal` from the peer. Verify it returns the expected IP. If you're using internal DNS + VPN, make sure DNS is routed through the VPN. ## Symptom: "tls handshake failed" Peer connects, TCP works, TLS fails. ```text error: tls handshake failed caused by: certificate verify failed: unable to get local issuer certificate ``` ### Diagnosis This almost always means fingerprint pinning detected a mismatch. Either: 1. The Mothership's cert was rotated and the peer still has the old fingerprint from its join token. 2. Something in the network is terminating TLS (corporate proxy, inspection appliance). 3. The peer is talking to the wrong host. Check the fingerprint on the Mothership: ```bash aura mothership tls info ``` Compare to the fingerprint the peer expects. On the peer, look at the stored Mothership cert: ```bash aura mothership peer-config --show-pinned ``` If they differ, that's the issue. ### Fix If the Mothership cert was rotated intentionally, the peer needs a new join token (which will pin the new fingerprint) or an explicit re-pin: ```bash aura mothership repin --url https://mothership.acme.internal:7777 ``` `repin` requires the user to visually confirm the new fingerprint, which they should have received out of band. If something is TLS-terminating in the middle (common in heavily managed corporate networks): there is no good workaround. The whole point of fingerprint pinning is to reject this. Get your team's traffic onto a segment that does not inspect TLS, or use a mesh VPN like Tailscale that tunnels below the inspection layer. See [TLS and JWT](/tls-and-jwt) for the underlying model. ## Symptom: "certificate has expired" Self-signed certs are valid for 365 days. If nothing rotated them, they expire. ```text error: tls: certificate has expired ``` ### Fix On the Mothership host: ```bash aura mothership tls rotate ``` This generates a new certificate. Existing peers within the grace window (default 24h) accept both old and new fingerprints. Peers who joined long ago may need to re-pin: ```bash aura mothership repin --url https://mothership.acme.internal:7777 ``` For production, prefer Let's Encrypt or an internal CA with automated renewal. See [TLS and JWT](/tls-and-jwt). > **Security callout.** A Mothership with an expired cert is a monitoring failure. Alert on `aura_tls_cert_not_after_seconds < 30 * 86400` (30 days out) so you rotate before the outage. ## Symptom: "token expired" on join Self-explanatory. The JWT's `exp` claim has passed. ### Fix Issue a new token: ```bash aura mothership token issue --for alice@acme.com ``` Default expiry is 24 hours. Tell your team to run `join` within that window, or issue longer-lived tokens for offshore time zones. ## Symptom: "token already used" Most join tokens are one-shot. A reused token is rejected. ### Fix Issue a fresh token. If you genuinely need a multi-use token (bulk provisioning), issue one explicitly: ```bash aura mothership token issue --for team-onboarding --uses 20 --expires-in 72h ``` Track it carefully. Multi-use tokens are a larger blast radius. See [join token security](/join-token-security). ## Symptom: "invalid signature" The token was signed by a different Mothership, or the Mothership's JWT key was rotated. ### Diagnosis ```bash aura mothership token decode ``` Check the `iss` claim — it identifies the issuing Mothership. If it doesn't match the host you're trying to join, the user has the wrong token. If `iss` matches but verification fails, the Mothership's JWT key was rotated after this token was issued. Check: ```bash aura mothership audit --filter key_rotation --since 7d ``` ### Fix Re-issue from the current Mothership: ```bash aura mothership token issue --for alice@acme.com ``` ## Symptom: "clock skew too large" JWT `exp` / `iat` validation is strict. Clocks more than 60 seconds apart fail. ```text error: token not yet valid (iat is in the future) ``` ### Diagnosis ```bash # on peer date -u # on mothership date -u ``` Difference greater than a minute? NTP is broken somewhere. ### Fix ```bash # Linux sudo timedatectl set-ntp true sudo systemctl restart systemd-timesyncd # macOS sudo sntp -sS time.apple.com ``` Wait a minute, try again. If the problem persists, your NTP server is unreachable — check firewall rules for UDP 123. ## Symptom: peers see each other as offline despite being online Control connection is up but peer-to-peer isn't discovering. ### Diagnosis On a peer: ```bash aura mothership peers --detail ``` ```text peer_id subject status last_seen direct_connects peer_0a1b2c3d alice@acme online 3s 3/3 peer_1b2c3d4e bob@acme online 8s 0/3 ``` `0/3` direct connections to Bob means every attempt to reach Bob directly failed. Three direct connect attempts = three network hints. All failing usually means Bob's announced addresses aren't reachable. ### Fix This is almost always a NAT or VPN issue. Two paths forward: 1. **Put everyone on a mesh VPN** like Tailscale. Announced addresses become reachable. See [p2p architecture](/p2p-architecture). 2. **Live with relay.** Direct connections are an optimization. Mothership will fall back to relaying through itself; nothing breaks, it just uses more Mothership bandwidth. If you use Tailscale but direct is still failing, check that the Tailscale IP is being announced: ```bash aura mothership peer-config --show-announced ``` You should see a `100.x.x.x` address. If not, restart `aura sync` after Tailscale is up. ## Symptom: WAL growing without bound The WAL keeps getting bigger even when online. ### Diagnosis ```bash aura wal status ``` ```text WAL: unacked events: 15234 ``` Thousands of unacked events while online means events are being written faster than the Mothership is accepting them, or the Mothership is silently failing to ack. On the Mothership: ```bash aura mothership status ``` Check `sync backlog`. A growing backlog means the Mothership is overloaded. See [scaling](/scaling-mothership). ### Fix - If the Mothership is CPU-bound: raise `sync_workers` (see [scaling](/scaling-mothership)). - If the Mothership is network-bound: look at `iftop` / `nload` on the host. - If the issue is one noisy peer: find it with `aura mothership peers --detail | sort -k events_per_min`, and coordinate with them to reduce push rate. ## Symptom: "peer revoked" on reconnect Peer was revoked by an admin. Expected behavior. ### Fix If revocation was intentional: nothing to do. If it was accidental: a Mothership admin can un-revoke: ```bash aura mothership peer unrevoke peer_0a1b2c3d ``` And the peer will reconnect on its next retry. ## Symptom: "protocol version mismatch" Peer binary and Mothership binary are from incompatible Aura versions. ```text error: protocol version mismatch (peer: v1, mothership: v2) ``` ### Fix Update the older of the two. Mothership protocol is stable within major versions; a version bump of Mothership may require peers to upgrade to a compatible point release: ```bash # on peer aura self-update ``` Coordinate Mothership upgrades with a heads-up to the team: "Rolling Mothership to v0.13.0 at 4pm; you may see a brief disconnect and a prompt to `aura self-update`." ## Symptom: health endpoint returns 503 Mothership is up but `/healthz` returns `503`. ### Diagnosis ```bash curl -k https://mship:7777/healthz ``` ```text {"status":"degraded","reason":"wal_behind"} ``` Common reasons: - `wal_behind`: disk I/O can't keep up with write rate. Get faster disks. - `tls_cert_expiring_soon`: cert expires within 7 days. Rotate. - `federation_partner_unreachable`: a federated Mothership is down. Reach out. ### Fix Each reason has a different fix. `aura doctor` on the host expands the diagnosis. ## Symptom: "aura doctor" shows a warning you don't understand `aura doctor` outputs compact diagnostic codes. The full explanation for each code is in the output: ```bash aura doctor --explain DR023 ``` Covers remediation steps, when the check was added, and links to the relevant documentation. ## When to Open an Issue After you have: - Read this page and [aura doctor](/aura-doctor) output. - Checked Mothership logs for the specific error. - Verified your version is current. - Confirmed the issue is not network- or clock-related. File an issue on the Aura repository with: - Your Aura version (`aura --version`). - The exact error text. - Redacted output from `aura mothership status` and `aura doctor`. - A rough sketch of your topology (number of peers, Motherships, network). Do not include raw tokens, private keys, or TLS material in issue reports. ## Next Steps - [Understand TLS and JWT in depth](/tls-and-jwt) - [Tune for scale](/scaling-mothership) - [Run a persistent, monitored Mothership](/persistent-daemon)