# Mothership Troubleshooting

_Port conflicts, firewall black holes, TLS failures, JWT expiry. What breaks in the wild and how to fix it._

## Overview

Most Mothership problems fall into four buckets: the port, the network path, the certificate, or the token. This page is organized by symptom — what you see when something is wrong — and walks through diagnosis and fix. It is not exhaustive. It covers the issues we have seen most often across production deployments.

Before filing anything upstream, always run:

```bash
aura mothership status
aura doctor
```

`aura doctor` performs a battery of self-checks and prints a diagnosis. A lot of the problems below it will catch automatically.

## Symptom: "address already in use" on start

Mothership failed to bind because something else is holding the port.

```bash
aura mothership start
```

```text
error: failed to bind 0.0.0.0:7777
  caused by: address already in use
```

### Diagnosis

```bash
sudo lsof -i :7777
# or
sudo ss -tlnp | grep 7777
```

You will usually see either a previous Mothership process that didn't clean up, or an unrelated service that grabbed the port.

### Fix

If it is a stale Mothership:

```bash
aura mothership stop
# or, if the CLI can't find it
sudo kill $(pgrep -f 'aura mothership')
```

If it is a different service, pick a different port:

```bash
aura mothership start --port 8443
```

If you are on Linux and the port is below 1024, you need either root or the `CAP_NET_BIND_SERVICE` capability on the binary:

```bash
sudo setcap cap_net_bind_service=+ep $(which aura)
```

## Symptom: peers can't reach the Mothership

`aura mothership ping` from a peer times out or refuses.

```text
ping mothership.acme.internal:7777
  tcp connect:    timeout
```

### Diagnosis

The three causes, in order of likelihood:

1. **Firewall blocking.** The Mothership is listening but packets never arrive.
2. **Bind address is wrong.** Mothership bound to `127.0.0.1` when it should have bound to `0.0.0.0` or a specific LAN IP.
3. **DNS resolving to the wrong address.** Peer reaches _an_ IP, but not the Mothership's.

Check from the Mothership host:

```bash
aura mothership status
```

Look at `listening`. If it says `127.0.0.1:7777`, that's your problem.

```bash
sudo ss -tlnp | grep aura
```

If that shows `LISTEN 0 128 0.0.0.0:7777`, Mothership is listening on all interfaces. The problem is network-side.

From a peer, try a basic connectivity check:

```bash
nc -zv mothership.acme.internal 7777
```

`refused` means something is responding. `timeout` means packets are being black-holed, usually a firewall.

### Fix

**Firewall** (common culprits): `ufw` on Ubuntu, `firewalld` on RHEL/Fedora, `iptables` rules, cloud security groups.

```bash
# Ubuntu
sudo ufw allow 7777/tcp

# RHEL/Fedora
sudo firewall-cmd --permanent --add-port=7777/tcp
sudo firewall-cmd --reload

# AWS security group: add inbound TCP 7777 from the CIDR of your team
```

**Wrong bind**: fix the config file or `--bind` flag. See [starting mothership](/starting-mothership).

**DNS**: run `dig mothership.acme.internal` from the peer. Verify it returns the expected IP. If you're using internal DNS + VPN, make sure DNS is routed through the VPN.

## Symptom: "tls handshake failed"

Peer connects, TCP works, TLS fails.

```text
error: tls handshake failed
  caused by: certificate verify failed: unable to get local issuer certificate
```

### Diagnosis

This almost always means fingerprint pinning detected a mismatch. Either:

1. The Mothership's cert was rotated and the peer still has the old fingerprint from its join token.
2. Something in the network is terminating TLS (corporate proxy, inspection appliance).
3. The peer is talking to the wrong host.

Check the fingerprint on the Mothership:

```bash
aura mothership tls info
```

Compare to the fingerprint the peer expects. On the peer, look at the stored Mothership cert:

```bash
aura mothership peer-config --show-pinned
```

If they differ, that's the issue.

### Fix

If the Mothership cert was rotated intentionally, the peer needs a new join token (which will pin the new fingerprint) or an explicit re-pin:

```bash
aura mothership repin --url https://mothership.acme.internal:7777
```

`repin` requires the user to visually confirm the new fingerprint, which they should have received out of band.

If something is TLS-terminating in the middle (common in heavily managed corporate networks): there is no good workaround. The whole point of fingerprint pinning is to reject this. Get your team's traffic onto a segment that does not inspect TLS, or use a mesh VPN like Tailscale that tunnels below the inspection layer.

See [TLS and JWT](/tls-and-jwt) for the underlying model.

## Symptom: "certificate has expired"

Self-signed certs are valid for 365 days. If nothing rotated them, they expire.

```text
error: tls: certificate has expired
```

### Fix

On the Mothership host:

```bash
aura mothership tls rotate
```

This generates a new certificate. Existing peers within the grace window (default 24h) accept both old and new fingerprints. Peers who joined long ago may need to re-pin:

```bash
aura mothership repin --url https://mothership.acme.internal:7777
```

For production, prefer Let's Encrypt or an internal CA with automated renewal. See [TLS and JWT](/tls-and-jwt).

> **Security callout.** A Mothership with an expired cert is a monitoring failure. Alert on `aura_tls_cert_not_after_seconds < 30 * 86400` (30 days out) so you rotate before the outage.

## Symptom: "token expired" on join

Self-explanatory. The JWT's `exp` claim has passed.

### Fix

Issue a new token:

```bash
aura mothership token issue --for alice@acme.com
```

Default expiry is 24 hours. Tell your team to run `join` within that window, or issue longer-lived tokens for offshore time zones.

## Symptom: "token already used"

Most join tokens are one-shot. A reused token is rejected.

### Fix

Issue a fresh token. If you genuinely need a multi-use token (bulk provisioning), issue one explicitly:

```bash
aura mothership token issue --for team-onboarding --uses 20 --expires-in 72h
```

Track it carefully. Multi-use tokens are a larger blast radius. See [join token security](/join-token-security).

## Symptom: "invalid signature"

The token was signed by a different Mothership, or the Mothership's JWT key was rotated.

### Diagnosis

```bash
aura mothership token decode <token>
```

Check the `iss` claim — it identifies the issuing Mothership. If it doesn't match the host you're trying to join, the user has the wrong token.

If `iss` matches but verification fails, the Mothership's JWT key was rotated after this token was issued. Check:

```bash
aura mothership audit --filter key_rotation --since 7d
```

### Fix

Re-issue from the current Mothership:

```bash
aura mothership token issue --for alice@acme.com
```

## Symptom: "clock skew too large"

JWT `exp` / `iat` validation is strict. Clocks more than 60 seconds apart fail.

```text
error: token not yet valid (iat is in the future)
```

### Diagnosis

```bash
# on peer
date -u
# on mothership
date -u
```

Difference greater than a minute? NTP is broken somewhere.

### Fix

```bash
# Linux
sudo timedatectl set-ntp true
sudo systemctl restart systemd-timesyncd

# macOS
sudo sntp -sS time.apple.com
```

Wait a minute, try again. If the problem persists, your NTP server is unreachable — check firewall rules for UDP 123.

## Symptom: peers see each other as offline despite being online

Control connection is up but peer-to-peer isn't discovering.

### Diagnosis

On a peer:

```bash
aura mothership peers --detail
```

```text
peer_id          subject          status    last_seen  direct_connects
peer_0a1b2c3d    alice@acme       online    3s         3/3
peer_1b2c3d4e    bob@acme         online    8s         0/3
```

`0/3` direct connections to Bob means every attempt to reach Bob directly failed. Three direct connect attempts = three network hints. All failing usually means Bob's announced addresses aren't reachable.

### Fix

This is almost always a NAT or VPN issue. Two paths forward:

1. **Put everyone on a mesh VPN** like Tailscale. Announced addresses become reachable. See [p2p architecture](/p2p-architecture).
2. **Live with relay.** Direct connections are an optimization. Mothership will fall back to relaying through itself; nothing breaks, it just uses more Mothership bandwidth.

If you use Tailscale but direct is still failing, check that the Tailscale IP is being announced:

```bash
aura mothership peer-config --show-announced
```

You should see a `100.x.x.x` address. If not, restart `aura sync` after Tailscale is up.

## Symptom: WAL growing without bound

The WAL keeps getting bigger even when online.

### Diagnosis

```bash
aura wal status
```

```text
WAL:
  unacked events: 15234
```

Thousands of unacked events while online means events are being written faster than the Mothership is accepting them, or the Mothership is silently failing to ack.

On the Mothership:

```bash
aura mothership status
```

Check `sync backlog`. A growing backlog means the Mothership is overloaded. See [scaling](/scaling-mothership).

### Fix

- If the Mothership is CPU-bound: raise `sync_workers` (see [scaling](/scaling-mothership)).
- If the Mothership is network-bound: look at `iftop` / `nload` on the host.
- If the issue is one noisy peer: find it with `aura mothership peers --detail | sort -k events_per_min`, and coordinate with them to reduce push rate.

## Symptom: "peer revoked" on reconnect

Peer was revoked by an admin. Expected behavior.

### Fix

If revocation was intentional: nothing to do.

If it was accidental: a Mothership admin can un-revoke:

```bash
aura mothership peer unrevoke peer_0a1b2c3d
```

And the peer will reconnect on its next retry.

## Symptom: "protocol version mismatch"

Peer binary and Mothership binary are from incompatible Aura versions.

```text
error: protocol version mismatch (peer: v1, mothership: v2)
```

### Fix

Update the older of the two. Mothership protocol is stable within major versions; a version bump of Mothership may require peers to upgrade to a compatible point release:

```bash
# on peer
aura self-update
```

Coordinate Mothership upgrades with a heads-up to the team: "Rolling Mothership to v0.13.0 at 4pm; you may see a brief disconnect and a prompt to `aura self-update`."

## Symptom: health endpoint returns 503

Mothership is up but `/healthz` returns `503`.

### Diagnosis

```bash
curl -k https://mship:7777/healthz
```

```text
{"status":"degraded","reason":"wal_behind"}
```

Common reasons:

- `wal_behind`: disk I/O can't keep up with write rate. Get faster disks.
- `tls_cert_expiring_soon`: cert expires within 7 days. Rotate.
- `federation_partner_unreachable`: a federated Mothership is down. Reach out.

### Fix

Each reason has a different fix. `aura doctor` on the host expands the diagnosis.

## Symptom: "aura doctor" shows a warning you don't understand

`aura doctor` outputs compact diagnostic codes. The full explanation for each code is in the output:

```bash
aura doctor --explain DR023
```

Covers remediation steps, when the check was added, and links to the relevant documentation.

## When to Open an Issue

After you have:

- Read this page and [aura doctor](/aura-doctor) output.
- Checked Mothership logs for the specific error.
- Verified your version is current.
- Confirmed the issue is not network- or clock-related.

File an issue on the Aura repository with:

- Your Aura version (`aura --version`).
- The exact error text.
- Redacted output from `aura mothership status` and `aura doctor`.
- A rough sketch of your topology (number of peers, Motherships, network).

Do not include raw tokens, private keys, or TLS material in issue reports.

## Next Steps

- [Understand TLS and JWT in depth](/tls-and-jwt)
- [Tune for scale](/scaling-mothership)
- [Run a persistent, monitored Mothership](/persistent-daemon)