Mothership Troubleshooting
Port conflicts, firewall black holes, TLS failures, JWT expiry. What breaks in the wild and how to fix it.
Overview
Most Mothership problems fall into four buckets: the port, the network path, the certificate, or the token. This page is organized by symptom — what you see when something is wrong — and walks through diagnosis and fix. It is not exhaustive. It covers the issues we have seen most often across production deployments.
Before filing anything upstream, always run:
aura mothership status
aura doctor
aura doctor performs a battery of self-checks and prints a diagnosis. A lot of the problems below it will catch automatically.
Symptom: "address already in use" on start
Mothership failed to bind because something else is holding the port.
aura mothership start
error: failed to bind 0.0.0.0:7777
caused by: address already in use
Diagnosis
sudo lsof -i :7777
# or
sudo ss -tlnp | grep 7777
You will usually see either a previous Mothership process that didn't clean up, or an unrelated service that grabbed the port.
Fix
If it is a stale Mothership:
aura mothership stop
# or, if the CLI can't find it
sudo kill $(pgrep -f 'aura mothership')
If it is a different service, pick a different port:
aura mothership start --port 8443
If you are on Linux and the port is below 1024, you need either root or the CAP_NET_BIND_SERVICE capability on the binary:
sudo setcap cap_net_bind_service=+ep $(which aura)
Symptom: peers can't reach the Mothership
aura mothership ping from a peer times out or refuses.
ping mothership.acme.internal:7777
tcp connect: timeout
Diagnosis
The three causes, in order of likelihood:
- Firewall blocking. The Mothership is listening but packets never arrive.
- Bind address is wrong. Mothership bound to
127.0.0.1when it should have bound to0.0.0.0or a specific LAN IP. - DNS resolving to the wrong address. Peer reaches an IP, but not the Mothership's.
Check from the Mothership host:
aura mothership status
Look at listening. If it says 127.0.0.1:7777, that's your problem.
sudo ss -tlnp | grep aura
If that shows LISTEN 0 128 0.0.0.0:7777, Mothership is listening on all interfaces. The problem is network-side.
From a peer, try a basic connectivity check:
nc -zv mothership.acme.internal 7777
refused means something is responding. timeout means packets are being black-holed, usually a firewall.
Fix
Firewall (common culprits): ufw on Ubuntu, firewalld on RHEL/Fedora, iptables rules, cloud security groups.
# Ubuntu
sudo ufw allow 7777/tcp
# RHEL/Fedora
sudo firewall-cmd --permanent --add-port=7777/tcp
sudo firewall-cmd --reload
# AWS security group: add inbound TCP 7777 from the CIDR of your team
Wrong bind: fix the config file or --bind flag. See starting mothership.
DNS: run dig mothership.acme.internal from the peer. Verify it returns the expected IP. If you're using internal DNS + VPN, make sure DNS is routed through the VPN.
Symptom: "tls handshake failed"
Peer connects, TCP works, TLS fails.
error: tls handshake failed
caused by: certificate verify failed: unable to get local issuer certificate
Diagnosis
This almost always means fingerprint pinning detected a mismatch. Either:
- The Mothership's cert was rotated and the peer still has the old fingerprint from its join token.
- Something in the network is terminating TLS (corporate proxy, inspection appliance).
- The peer is talking to the wrong host.
Check the fingerprint on the Mothership:
aura mothership tls info
Compare to the fingerprint the peer expects. On the peer, look at the stored Mothership cert:
aura mothership peer-config --show-pinned
If they differ, that's the issue.
Fix
If the Mothership cert was rotated intentionally, the peer needs a new join token (which will pin the new fingerprint) or an explicit re-pin:
aura mothership repin --url https://mothership.acme.internal:7777
repin requires the user to visually confirm the new fingerprint, which they should have received out of band.
If something is TLS-terminating in the middle (common in heavily managed corporate networks): there is no good workaround. The whole point of fingerprint pinning is to reject this. Get your team's traffic onto a segment that does not inspect TLS, or use a mesh VPN like Tailscale that tunnels below the inspection layer.
See TLS and JWT for the underlying model.
Symptom: "certificate has expired"
Self-signed certs are valid for 365 days. If nothing rotated them, they expire.
error: tls: certificate has expired
Fix
On the Mothership host:
aura mothership tls rotate
This generates a new certificate. Existing peers within the grace window (default 24h) accept both old and new fingerprints. Peers who joined long ago may need to re-pin:
aura mothership repin --url https://mothership.acme.internal:7777
For production, prefer Let's Encrypt or an internal CA with automated renewal. See TLS and JWT.
Security callout. A Mothership with an expired cert is a monitoring failure. Alert on
aura_tls_cert_not_after_seconds < 30 * 86400(30 days out) so you rotate before the outage.
Symptom: "token expired" on join
Self-explanatory. The JWT's exp claim has passed.
Fix
Issue a new token:
aura mothership token issue --for alice@acme.com
Default expiry is 24 hours. Tell your team to run join within that window, or issue longer-lived tokens for offshore time zones.
Symptom: "token already used"
Most join tokens are one-shot. A reused token is rejected.
Fix
Issue a fresh token. If you genuinely need a multi-use token (bulk provisioning), issue one explicitly:
aura mothership token issue --for team-onboarding --uses 20 --expires-in 72h
Track it carefully. Multi-use tokens are a larger blast radius. See join token security.
Symptom: "invalid signature"
The token was signed by a different Mothership, or the Mothership's JWT key was rotated.
Diagnosis
aura mothership token decode <token>
Check the iss claim — it identifies the issuing Mothership. If it doesn't match the host you're trying to join, the user has the wrong token.
If iss matches but verification fails, the Mothership's JWT key was rotated after this token was issued. Check:
aura mothership audit --filter key_rotation --since 7d
Fix
Re-issue from the current Mothership:
aura mothership token issue --for alice@acme.com
Symptom: "clock skew too large"
JWT exp / iat validation is strict. Clocks more than 60 seconds apart fail.
error: token not yet valid (iat is in the future)
Diagnosis
# on peer
date -u
# on mothership
date -u
Difference greater than a minute? NTP is broken somewhere.
Fix
# Linux
sudo timedatectl set-ntp true
sudo systemctl restart systemd-timesyncd
# macOS
sudo sntp -sS time.apple.com
Wait a minute, try again. If the problem persists, your NTP server is unreachable — check firewall rules for UDP 123.
Symptom: peers see each other as offline despite being online
Control connection is up but peer-to-peer isn't discovering.
Diagnosis
On a peer:
aura mothership peers --detail
peer_id subject status last_seen direct_connects
peer_0a1b2c3d alice@acme online 3s 3/3
peer_1b2c3d4e bob@acme online 8s 0/3
0/3 direct connections to Bob means every attempt to reach Bob directly failed. Three direct connect attempts = three network hints. All failing usually means Bob's announced addresses aren't reachable.
Fix
This is almost always a NAT or VPN issue. Two paths forward:
- Put everyone on a mesh VPN like Tailscale. Announced addresses become reachable. See p2p architecture.
- Live with relay. Direct connections are an optimization. Mothership will fall back to relaying through itself; nothing breaks, it just uses more Mothership bandwidth.
If you use Tailscale but direct is still failing, check that the Tailscale IP is being announced:
aura mothership peer-config --show-announced
You should see a 100.x.x.x address. If not, restart aura sync after Tailscale is up.
Symptom: WAL growing without bound
The WAL keeps getting bigger even when online.
Diagnosis
aura wal status
WAL:
unacked events: 15234
Thousands of unacked events while online means events are being written faster than the Mothership is accepting them, or the Mothership is silently failing to ack.
On the Mothership:
aura mothership status
Check sync backlog. A growing backlog means the Mothership is overloaded. See scaling.
Fix
- If the Mothership is CPU-bound: raise
sync_workers(see scaling). - If the Mothership is network-bound: look at
iftop/nloadon the host. - If the issue is one noisy peer: find it with
aura mothership peers --detail | sort -k events_per_min, and coordinate with them to reduce push rate.
Symptom: "peer revoked" on reconnect
Peer was revoked by an admin. Expected behavior.
Fix
If revocation was intentional: nothing to do.
If it was accidental: a Mothership admin can un-revoke:
aura mothership peer unrevoke peer_0a1b2c3d
And the peer will reconnect on its next retry.
Symptom: "protocol version mismatch"
Peer binary and Mothership binary are from incompatible Aura versions.
error: protocol version mismatch (peer: v1, mothership: v2)
Fix
Update the older of the two. Mothership protocol is stable within major versions; a version bump of Mothership may require peers to upgrade to a compatible point release:
# on peer
aura self-update
Coordinate Mothership upgrades with a heads-up to the team: "Rolling Mothership to v0.13.0 at 4pm; you may see a brief disconnect and a prompt to aura self-update."
Symptom: health endpoint returns 503
Mothership is up but /healthz returns 503.
Diagnosis
curl -k https://mship:7777/healthz
{"status":"degraded","reason":"wal_behind"}
Common reasons:
wal_behind: disk I/O can't keep up with write rate. Get faster disks.tls_cert_expiring_soon: cert expires within 7 days. Rotate.federation_partner_unreachable: a federated Mothership is down. Reach out.
Fix
Each reason has a different fix. aura doctor on the host expands the diagnosis.
Symptom: "aura doctor" shows a warning you don't understand
aura doctor outputs compact diagnostic codes. The full explanation for each code is in the output:
aura doctor --explain DR023
Covers remediation steps, when the check was added, and links to the relevant documentation.
When to Open an Issue
After you have:
- Read this page and aura doctor output.
- Checked Mothership logs for the specific error.
- Verified your version is current.
- Confirmed the issue is not network- or clock-related.
File an issue on the Aura repository with:
- Your Aura version (
aura --version). - The exact error text.
- Redacted output from
aura mothership statusandaura doctor. - A rough sketch of your topology (number of peers, Motherships, network).
Do not include raw tokens, private keys, or TLS material in issue reports.