Mothership Troubleshooting

Port conflicts, firewall black holes, TLS failures, JWT expiry. What breaks in the wild and how to fix it.

Overview

Most Mothership problems fall into four buckets: the port, the network path, the certificate, or the token. This page is organized by symptom — what you see when something is wrong — and walks through diagnosis and fix. It is not exhaustive. It covers the issues we have seen most often across production deployments.

Before filing anything upstream, always run:

aura mothership status
aura doctor

aura doctor performs a battery of self-checks and prints a diagnosis. A lot of the problems below it will catch automatically.

Symptom: "address already in use" on start

Mothership failed to bind because something else is holding the port.

aura mothership start

error: failed to bind 0.0.0.0:7777
  caused by: address already in use

Diagnosis

sudo lsof -i :7777
# or
sudo ss -tlnp | grep 7777

You will usually see either a previous Mothership process that didn't clean up, or an unrelated service that grabbed the port.

Fix

If it is a stale Mothership:

aura mothership stop
# or, if the CLI can't find it
sudo kill $(pgrep -f 'aura mothership')

If it is a different service, pick a different port:

aura mothership start --port 8443

If you are on Linux and the port is below 1024, you need either root or the CAP_NET_BIND_SERVICE capability on the binary:

sudo setcap cap_net_bind_service=+ep $(which aura)

Symptom: peers can't reach the Mothership

aura mothership ping from a peer times out or refuses.

ping mothership.acme.internal:7777
  tcp connect:    timeout

Diagnosis

The three causes, in order of likelihood:

Firewall blocking. The Mothership is listening but packets never arrive.
Bind address is wrong. Mothership bound to 127.0.0.1 when it should have bound to 0.0.0.0 or a specific LAN IP.
DNS resolving to the wrong address. Peer reaches an IP, but not the Mothership's.

Check from the Mothership host:

aura mothership status

Look at listening. If it says 127.0.0.1:7777, that's your problem.

sudo ss -tlnp | grep aura

If that shows LISTEN 0 128 0.0.0.0:7777, Mothership is listening on all interfaces. The problem is network-side.

From a peer, try a basic connectivity check:

nc -zv mothership.acme.internal 7777

refused means something is responding. timeout means packets are being black-holed, usually a firewall.

Fix

Firewall (common culprits): ufw on Ubuntu, firewalld on RHEL/Fedora, iptables rules, cloud security groups.

# Ubuntu
sudo ufw allow 7777/tcp

# RHEL/Fedora
sudo firewall-cmd --permanent --add-port=7777/tcp
sudo firewall-cmd --reload

# AWS security group: add inbound TCP 7777 from the CIDR of your team

Wrong bind: fix the config file or --bind flag. See starting mothership.

DNS: run dig mothership.acme.internal from the peer. Verify it returns the expected IP. If you're using internal DNS + VPN, make sure DNS is routed through the VPN.

Symptom: "tls handshake failed"

Peer connects, TCP works, TLS fails.

error: tls handshake failed
  caused by: certificate verify failed: unable to get local issuer certificate

Diagnosis

This almost always means fingerprint pinning detected a mismatch. Either:

The Mothership's cert was rotated and the peer still has the old fingerprint from its join token.
Something in the network is terminating TLS (corporate proxy, inspection appliance).
The peer is talking to the wrong host.

Check the fingerprint on the Mothership:

aura mothership tls info

Compare to the fingerprint the peer expects. On the peer, look at the stored Mothership cert:

aura mothership peer-config --show-pinned

If they differ, that's the issue.

Fix

If the Mothership cert was rotated intentionally, the peer needs a new join token (which will pin the new fingerprint) or an explicit re-pin:

aura mothership repin --url https://mothership.acme.internal:7777

repin requires the user to visually confirm the new fingerprint, which they should have received out of band.

If something is TLS-terminating in the middle (common in heavily managed corporate networks): there is no good workaround. The whole point of fingerprint pinning is to reject this. Get your team's traffic onto a segment that does not inspect TLS, or use a mesh VPN like Tailscale that tunnels below the inspection layer.

See TLS and JWT for the underlying model.

Symptom: "certificate has expired"

Self-signed certs are valid for 365 days. If nothing rotated them, they expire.

error: tls: certificate has expired

Fix

On the Mothership host:

aura mothership tls rotate

This generates a new certificate. Existing peers within the grace window (default 24h) accept both old and new fingerprints. Peers who joined long ago may need to re-pin:

aura mothership repin --url https://mothership.acme.internal:7777

For production, prefer Let's Encrypt or an internal CA with automated renewal. See TLS and JWT.

Security callout. A Mothership with an expired cert is a monitoring failure. Alert on aura_tls_cert_not_after_seconds < 30 * 86400 (30 days out) so you rotate before the outage.

Symptom: "token expired" on join

Self-explanatory. The JWT's exp claim has passed.

Fix

Issue a new token:

aura mothership token issue --for alice@acme.com

Default expiry is 24 hours. Tell your team to run join within that window, or issue longer-lived tokens for offshore time zones.

Symptom: "token already used"

Most join tokens are one-shot. A reused token is rejected.

Fix

Issue a fresh token. If you genuinely need a multi-use token (bulk provisioning), issue one explicitly:

aura mothership token issue --for team-onboarding --uses 20 --expires-in 72h

Track it carefully. Multi-use tokens are a larger blast radius. See join token security.

Symptom: "invalid signature"

The token was signed by a different Mothership, or the Mothership's JWT key was rotated.

Diagnosis

aura mothership token decode <token>

Check the iss claim — it identifies the issuing Mothership. If it doesn't match the host you're trying to join, the user has the wrong token.

If iss matches but verification fails, the Mothership's JWT key was rotated after this token was issued. Check:

aura mothership audit --filter key_rotation --since 7d

Fix

Re-issue from the current Mothership:

aura mothership token issue --for alice@acme.com

Symptom: "clock skew too large"

JWT exp / iat validation is strict. Clocks more than 60 seconds apart fail.

error: token not yet valid (iat is in the future)

Diagnosis

# on peer
date -u
# on mothership
date -u

Difference greater than a minute? NTP is broken somewhere.

Fix

# Linux
sudo timedatectl set-ntp true
sudo systemctl restart systemd-timesyncd

# macOS
sudo sntp -sS time.apple.com

Wait a minute, try again. If the problem persists, your NTP server is unreachable — check firewall rules for UDP 123.

Symptom: peers see each other as offline despite being online

Control connection is up but peer-to-peer isn't discovering.

Diagnosis

On a peer:

aura mothership peers --detail

peer_id          subject          status    last_seen  direct_connects
peer_0a1b2c3d    alice@acme       online    3s         3/3
peer_1b2c3d4e    bob@acme         online    8s         0/3

0/3 direct connections to Bob means every attempt to reach Bob directly failed. Three direct connect attempts = three network hints. All failing usually means Bob's announced addresses aren't reachable.

Fix

This is almost always a NAT or VPN issue. Two paths forward:

Put everyone on a mesh VPN like Tailscale. Announced addresses become reachable. See p2p architecture.
Live with relay. Direct connections are an optimization. Mothership will fall back to relaying through itself; nothing breaks, it just uses more Mothership bandwidth.

If you use Tailscale but direct is still failing, check that the Tailscale IP is being announced:

aura mothership peer-config --show-announced

You should see a 100.x.x.x address. If not, restart aura sync after Tailscale is up.

Symptom: WAL growing without bound

The WAL keeps getting bigger even when online.

Diagnosis

aura wal status

WAL:
  unacked events: 15234

Thousands of unacked events while online means events are being written faster than the Mothership is accepting them, or the Mothership is silently failing to ack.

On the Mothership:

aura mothership status

Check sync backlog. A growing backlog means the Mothership is overloaded. See scaling.

Fix

If the Mothership is CPU-bound: raise sync_workers (see scaling).
If the Mothership is network-bound: look at iftop / nload on the host.
If the issue is one noisy peer: find it with aura mothership peers --detail | sort -k events_per_min, and coordinate with them to reduce push rate.

Symptom: "peer revoked" on reconnect

Peer was revoked by an admin. Expected behavior.

Fix

If revocation was intentional: nothing to do.

If it was accidental: a Mothership admin can un-revoke:

aura mothership peer unrevoke peer_0a1b2c3d

And the peer will reconnect on its next retry.

Symptom: "protocol version mismatch"

Peer binary and Mothership binary are from incompatible Aura versions.

error: protocol version mismatch (peer: v1, mothership: v2)

Fix

Update the older of the two. Mothership protocol is stable within major versions; a version bump of Mothership may require peers to upgrade to a compatible point release:

# on peer
aura self-update

Coordinate Mothership upgrades with a heads-up to the team: "Rolling Mothership to v0.13.0 at 4pm; you may see a brief disconnect and a prompt to aura self-update."

Symptom: health endpoint returns 503

Mothership is up but /healthz returns 503.

Diagnosis

curl -k https://mship:7777/healthz

{"status":"degraded","reason":"wal_behind"}

Common reasons:

wal_behind: disk I/O can't keep up with write rate. Get faster disks.
tls_cert_expiring_soon: cert expires within 7 days. Rotate.
federation_partner_unreachable: a federated Mothership is down. Reach out.

Fix

Each reason has a different fix. aura doctor on the host expands the diagnosis.

Symptom: "aura doctor" shows a warning you don't understand

aura doctor outputs compact diagnostic codes. The full explanation for each code is in the output:

aura doctor --explain DR023

Covers remediation steps, when the check was added, and links to the relevant documentation.

When to Open an Issue

After you have:

Read this page and aura doctor output.
Checked Mothership logs for the specific error.
Verified your version is current.
Confirmed the issue is not network- or clock-related.

File an issue on the Aura repository with:

Your Aura version (aura --version).
The exact error text.
Redacted output from aura mothership status and aura doctor.
A rough sketch of your topology (number of peers, Motherships, network).

Do not include raw tokens, private keys, or TLS material in issue reports.

Mothership Troubleshooting

Overview

Symptom: "address already in use" on start

Diagnosis

Fix

Symptom: peers can't reach the Mothership

Diagnosis

Fix

Symptom: "tls handshake failed"

Diagnosis

Fix

Symptom: "certificate has expired"

Fix

Symptom: "token expired" on join

Fix

Symptom: "token already used"

Fix

Symptom: "invalid signature"

Diagnosis

Fix

Symptom: "clock skew too large"

Diagnosis

Fix

Symptom: peers see each other as offline despite being online

Diagnosis

Fix

Symptom: WAL growing without bound

Diagnosis

Fix

Symptom: "peer revoked" on reconnect

Fix

Symptom: "protocol version mismatch"

Fix

Symptom: health endpoint returns 503

Diagnosis

Fix

Symptom: "aura doctor" shows a warning you don't understand

When to Open an Issue

Next Steps