Operations Runbooks
Step-by-step guides for common operational scenarios. Each runbook covers diagnosis, resolution, and prevention.
Agent Not Connecting
When a mezd fails to register or loses its tunnel:
- Check agent logs for connection errors:
Check agent logs bash
journalctl -u mezd -n 100 --no-pager - Verify the join token has not expired:
List active tokens bash
mezctl tokens ls - Check certificate expiry on the agent. The agent's X.509
certificate lives under
$MEZITE_DATA_DIR/agent/x509.pem(default/var/lib/mezite/agent/x509.pem):Inspect agent certificate bashopenssl x509 -in /var/lib/mezite/agent/x509.pem -noout -dates - Check the reverse tunnel — verify the agent can reach port
3024 on the proxy:
Test tunnel connectivity bash
curl -v telnet://mezite.example.com:3024 - Check network / firewall rules — ensure ports 3024 and 3025 are open from the agent to the Mezite server.
Backup and Restore
All Mezite state — users, roles, tokens, CA private keys (encrypted at
rest with ca_key_passphrase), audit events, session
recordings metadata — lives in a single database. Regular backups of that
database, plus the recording storage backend, are sufficient to recover
the cluster.
- PostgreSQL backup with pg_dump: Backup PostgreSQL bash
pg_dump -h localhost -U mezite -d mezite -F c -f mezite-backup-$(date +%Y%m%d).dump - SQLite backup — the database file lives at
$data_dir/mezhub.db(default/var/lib/mezite/mezhub.db). Use SQLite's online backup command so writes in flight don't corrupt the copy:Backup SQLite bashsqlite3 /var/lib/mezite/mezhub.db ".backup '/backups/mezhub-$(date +%Y%m%d).db'" - Restore (PostgreSQL): Restore PostgreSQL bash
pg_restore -h localhost -U mezite -d mezite --clean --if-exists mezite-backup-20260324.dump - Restore (SQLite): stop
mezhub, copy the backup file into place, and startmezhubagain.Restore SQLite bashsudo systemctl stop mezhub sudo cp /backups/mezhub-20260324.db /var/lib/mezite/mezhub.db sudo chown mezite:mezite /var/lib/mezite/mezhub.db sudo systemctl start mezhub - Don't forget the recording bucket / dir — if you use S3 recording,
snapshot the bucket. If you use local recording, back up
$data_dir/recordings. - Verify the backup by restoring to a test instance before
relying on it for disaster recovery. You will also need the original
MEZITE_CA_KEY_PASSPHRASEon the restored instance to decrypt CA private keys.
CA Certificate Expiry
CA certificates and private keys are stored in the database (encrypted at
rest), not on disk. Monitor expiry with mezctl ca status and start
a rotation when the remaining lifetime drops below 90 days. Rotation is a multi-phase
state machine with four phases — init → update_clients → update_servers → complete. mezctl ca status reports both the phase (rotation_phase) and the rotation
state (rotation_state: standby / in_progress
/ rollback). The state flips back to standby only
when the final advance into complete runs.
- Check rotation state and expiry for every CA type (host,
user, spiffe):
Check CA status bash
mezctl ca status - Start rotation of the host CA — this mints a new key and
moves rotation into the
update_clientsphase. Agents re-download CAs on the next reconnect; once they have the new bundle, advance.Rotate host CA bashmezctl ca rotate --type=host - Advance through the rotation phases: Advance host CA rotation bash
# update_clients -> update_servers (re-issue server certs with new CA) mezctl ca advance --type=host # update_servers -> complete (finalize, drop old CA — irreversible) mezctl ca advance --type=host - Rotate the user CA the same way. Repeat for
--type=spiffeif you use workload identity.Rotate user CA bashmezctl ca rotate --type=user mezctl ca advance --type=user mezctl ca advance --type=user -
Rollback is only valid while
rotation_state = in_progress. If something looks wrong before you run the finaladvance, abort withmezctl ca rollback --type=<host|user|spiffe>to restore the previous CA. Once the finaladvanceintocompleteruns, the previous CA's keys are deleted and the rotation cannot be rolled back.
Database Performance
Applies to PostgreSQL deployments. SQLite is single-process and tuning is limited to disk performance.
- Identify slow queries: Find slow queries sql
SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10; - Run VACUUM: Vacuum and analyze sql
VACUUM ANALYZE; - Check connection pool usage — ensure
max_connectionsin PostgreSQL is set higher than the sum of all Mezite instances' pool sizes. - Lock contention — check for blocked queries:
Check for lock contention sql
SELECT blocked.pid, blocked.query, blocking.pid AS blocking_pid, blocking.query AS blocking_query FROM pg_stat_activity blocked JOIN pg_locks bl ON bl.pid = blocked.pid JOIN pg_locks bk ON bk.locktype = bl.locktype AND bk.relation = bl.relation AND bk.pid != bl.pid JOIN pg_stat_activity blocking ON blocking.pid = bk.pid WHERE NOT bl.granted;
Security Hardening Checklist
- TLS: Ensure all Mezite ports (3025, 3080, 3023, 3024) use
TLS. Terminate TLS at the proxy where possible to preserve mutual TLS; when
an upstream load balancer must terminate TLS, set
auth.grpc_allow_http: true(orMEZITE_AUTH_H2C=true) and pin a trusted-IP header / PROXY-protocol source viaproxy.trusted_ip_headerorproxy.proxy_protocol_trusted_cidrs. - Authentication: Require MFA for all human users. Keep certificate
lifetimes short by setting
max_session_ttlon each role (default 12h) — there is no global session-TTL knob. Prefer SSO connectors (OIDC, SAML, GitHub) over local passwords. - Authorization: Apply least-privilege roles. Restrict SSH logins by node label. Require access requests for privileged roles.
- Network: Restrict the auth port (3025) to internal networks. Expose only the proxy HTTPS port (3080) and the SSH port (3023) publicly. Use firewall rules to limit agent reverse-tunnel access (3024) to known agent subnets where possible.
- Audit: Enable session recording (
recording.backend). Forward audit events to an external SIEM with the webhook or file sink (MEZITE_AUDIT_SINK_WEBHOOK_URL/MEZITE_AUDIT_SINK_FILE_PATH). Audit events are not automatically pruned — operators must manage retention on the database and external sinks themselves. - CA private keys: Always set
ca_key_passphrase(viaMEZITE_CA_KEY_PASSPHRASE) in production. Without it, CA signing keys and other sensitive fields are stored in plaintext in the database.
Zero-Downtime Upgrade
- Pre-flight checks: back up the database (see
Backup and Restore) and confirm the
running cluster is healthy.
Pre-flight bash
# Verify the proxy is up curl -sf https://mezite.example.com:3080/healthz # Back up PostgreSQL (or SQLite — see Backup and Restore) pg_dump -h localhost -U mezite -d mezite -F c -f /backups/pre-upgrade-$(date +%Y%m%d).dump - Database migrations run automatically when
mezhubstarts — there is no separatemezhub migratesubcommand. Migrations are forward-only and idempotent; the firstmezhubinstance to start with the new binary applies them before serving traffic. - Rolling restart of mezhub instances: Replace one instance
at a time. Verify health with
/healthzbefore proceeding to the next. Multi-instance deployments rely on PostgreSQL with leader election (SQLite deployments are single-instance). - Upgrade agents: Agents are backward-compatible with newer
servers. Upgrade them after the server rollout is complete.
Upgrade agent binary bash
sudo systemctl stop mezd sudo cp mezd-new /usr/local/bin/mezd sudo systemctl start mezd - Rollback: If issues arise, stop the new instances, restore the database backup, and restart with the previous binary version. Because migrations are forward-only, a binary that pre-dates a migration cannot run against the upgraded schema — pin a tested previous version before relying on it for rollback.