Operations Runbooks

Step-by-step guides for common operational scenarios. Each runbook covers diagnosis, resolution, and prevention.

Agent Not Connecting

When a mezd fails to register or loses its tunnel:

  1. Check agent logs for connection errors:
    Check agent logs bash
    journalctl -u mezd -n 100 --no-pager
  2. Verify the join token has not expired:
    List active tokens bash
    mezctl tokens ls
  3. Check certificate expiry on the agent. The agent's X.509 certificate lives under $MEZITE_DATA_DIR/agent/x509.pem (default /var/lib/mezite/agent/x509.pem):
    Inspect agent certificate bash
    openssl x509 -in /var/lib/mezite/agent/x509.pem -noout -dates
  4. Check the reverse tunnel — verify the agent can reach port 3024 on the proxy:
    Test tunnel connectivity bash
    curl -v telnet://mezite.example.com:3024
  5. Check network / firewall rules — ensure ports 3024 and 3025 are open from the agent to the Mezite server.

Backup and Restore

All Mezite state — users, roles, tokens, CA private keys (encrypted at rest with ca_key_passphrase), audit events, session recordings metadata — lives in a single database. Regular backups of that database, plus the recording storage backend, are sufficient to recover the cluster.

  1. PostgreSQL backup with pg_dump:
    Backup PostgreSQL bash
    pg_dump -h localhost -U mezite -d mezite -F c -f mezite-backup-$(date +%Y%m%d).dump
  2. SQLite backup — the database file lives at $data_dir/mezhub.db (default /var/lib/mezite/mezhub.db). Use SQLite's online backup command so writes in flight don't corrupt the copy:
    Backup SQLite bash
    sqlite3 /var/lib/mezite/mezhub.db ".backup '/backups/mezhub-$(date +%Y%m%d).db'"
  3. Restore (PostgreSQL):
    Restore PostgreSQL bash
    pg_restore -h localhost -U mezite -d mezite --clean --if-exists mezite-backup-20260324.dump
  4. Restore (SQLite): stop mezhub, copy the backup file into place, and start mezhub again.
    Restore SQLite bash
    sudo systemctl stop mezhub
    sudo cp /backups/mezhub-20260324.db /var/lib/mezite/mezhub.db
    sudo chown mezite:mezite /var/lib/mezite/mezhub.db
    sudo systemctl start mezhub
  5. Don't forget the recording bucket / dir — if you use S3 recording, snapshot the bucket. If you use local recording, back up $data_dir/recordings.
  6. Verify the backup by restoring to a test instance before relying on it for disaster recovery. You will also need the original MEZITE_CA_KEY_PASSPHRASE on the restored instance to decrypt CA private keys.

CA Certificate Expiry

CA certificates and private keys are stored in the database (encrypted at rest), not on disk. Monitor expiry with mezctl ca status and start a rotation when the remaining lifetime drops below 90 days. Rotation is a multi-phase state machine with four phases — initupdate_clientsupdate_serverscomplete. mezctl ca status reports both the phase (rotation_phase) and the rotation state (rotation_state: standby / in_progress / rollback). The state flips back to standby only when the final advance into complete runs.

  1. Check rotation state and expiry for every CA type (host, user, spiffe):
    Check CA status bash
    mezctl ca status
  2. Start rotation of the host CA — this mints a new key and moves rotation into the update_clients phase. Agents re-download CAs on the next reconnect; once they have the new bundle, advance.
    Rotate host CA bash
    mezctl ca rotate --type=host
  3. Advance through the rotation phases:
    Advance host CA rotation bash
    # update_clients -> update_servers (re-issue server certs with new CA)
    mezctl ca advance --type=host
    
    # update_servers -> complete (finalize, drop old CA — irreversible)
    mezctl ca advance --type=host
  4. Rotate the user CA the same way. Repeat for --type=spiffe if you use workload identity.
    Rotate user CA bash
    mezctl ca rotate --type=user
    mezctl ca advance --type=user
    mezctl ca advance --type=user
  5. Rollback is only valid while rotation_state = in_progress. If something looks wrong before you run the final advance, abort with mezctl ca rollback --type=<host|user|spiffe> to restore the previous CA. Once the final advance into complete runs, the previous CA's keys are deleted and the rotation cannot be rolled back.

Database Performance

Applies to PostgreSQL deployments. SQLite is single-process and tuning is limited to disk performance.

  1. Identify slow queries:
    Find slow queries sql
    SELECT pid, now() - pg_stat_activity.query_start AS duration, query
    FROM pg_stat_activity
    WHERE state != 'idle'
    ORDER BY duration DESC
    LIMIT 10;
  2. Run VACUUM:
    Vacuum and analyze sql
    VACUUM ANALYZE;
  3. Check connection pool usage — ensure max_connections in PostgreSQL is set higher than the sum of all Mezite instances' pool sizes.
  4. Lock contention — check for blocked queries:
    Check for lock contention sql
    SELECT blocked.pid, blocked.query, blocking.pid AS blocking_pid, blocking.query AS blocking_query
    FROM pg_stat_activity blocked
    JOIN pg_locks bl ON bl.pid = blocked.pid
    JOIN pg_locks bk ON bk.locktype = bl.locktype AND bk.relation = bl.relation AND bk.pid != bl.pid
    JOIN pg_stat_activity blocking ON blocking.pid = bk.pid
    WHERE NOT bl.granted;

Security Hardening Checklist

  • TLS: Ensure all Mezite ports (3025, 3080, 3023, 3024) use TLS. Terminate TLS at the proxy where possible to preserve mutual TLS; when an upstream load balancer must terminate TLS, set auth.grpc_allow_http: true (or MEZITE_AUTH_H2C=true) and pin a trusted-IP header / PROXY-protocol source via proxy.trusted_ip_header or proxy.proxy_protocol_trusted_cidrs.
  • Authentication: Require MFA for all human users. Keep certificate lifetimes short by setting max_session_ttl on each role (default 12h) — there is no global session-TTL knob. Prefer SSO connectors (OIDC, SAML, GitHub) over local passwords.
  • Authorization: Apply least-privilege roles. Restrict SSH logins by node label. Require access requests for privileged roles.
  • Network: Restrict the auth port (3025) to internal networks. Expose only the proxy HTTPS port (3080) and the SSH port (3023) publicly. Use firewall rules to limit agent reverse-tunnel access (3024) to known agent subnets where possible.
  • Audit: Enable session recording (recording.backend). Forward audit events to an external SIEM with the webhook or file sink (MEZITE_AUDIT_SINK_WEBHOOK_URL / MEZITE_AUDIT_SINK_FILE_PATH). Audit events are not automatically pruned — operators must manage retention on the database and external sinks themselves.
  • CA private keys: Always set ca_key_passphrase (via MEZITE_CA_KEY_PASSPHRASE) in production. Without it, CA signing keys and other sensitive fields are stored in plaintext in the database.

Zero-Downtime Upgrade

  1. Pre-flight checks: back up the database (see Backup and Restore) and confirm the running cluster is healthy.
    Pre-flight bash
    # Verify the proxy is up
    curl -sf https://mezite.example.com:3080/healthz
    
    # Back up PostgreSQL (or SQLite — see Backup and Restore)
    pg_dump -h localhost -U mezite -d mezite -F c -f /backups/pre-upgrade-$(date +%Y%m%d).dump
  2. Database migrations run automatically when mezhub starts — there is no separate mezhub migrate subcommand. Migrations are forward-only and idempotent; the first mezhub instance to start with the new binary applies them before serving traffic.
  3. Rolling restart of mezhub instances: Replace one instance at a time. Verify health with /healthz before proceeding to the next. Multi-instance deployments rely on PostgreSQL with leader election (SQLite deployments are single-instance).
  4. Upgrade agents: Agents are backward-compatible with newer servers. Upgrade them after the server rollout is complete.
    Upgrade agent binary bash
    sudo systemctl stop mezd
    sudo cp mezd-new /usr/local/bin/mezd
    sudo systemctl start mezd
  5. Rollback: If issues arise, stop the new instances, restore the database backup, and restart with the previous binary version. Because migrations are forward-only, a binary that pre-dates a migration cannot run against the upgraded schema — pin a tested previous version before relying on it for rollback.