Disaster Recovery¶
Purpose¶
This runbook covers the practical recovery flow for the supported self-hosted DecisionGraph deployment.
It is intentionally single-node and Postgres-first.
Recovery Priorities¶
In order:
- protect the append-only event log
- restore service reachability
- restore projection readability
- restore workflow and operator-console usability
- record what happened and what manual actions were required
Scenario 1: App Node Restart Or Host Reboot¶
Symptoms:
- Phoenix is down
/api/healthzis unreachable- Postgres is still healthy
Operator flow:
- restart the BEAM runtime
- wait for
/api/healthzto return200 - check
GET /api/v1/projections/health - read one known trace
- read workflow inbox if workflows are in use
Expected outcome:
- persisted traces remain available
- workflow items remain available
- projections are current or catching up from durable cursors
Scenario 2: Postgres Temporarily Unavailable¶
Symptoms:
- health degrades or read/write surfaces fail
- logs show repo or query failures
Operator flow:
- restore Postgres availability first
- restart the app node if the runtime does not reconnect cleanly
- check
/api/healthz - check projection health for lag or open failures
- run an explicit replay only if projections remain stale after recovery
Scenario 3: Projection Corruption Or Stale Reads¶
Symptoms:
GET /api/v1/projections/healthshows stale projections or open failures- trace reads return
409or look incomplete - operator console shows replay or digest concerns
Operator flow:
- inspect projection health and open failure rows
- decide whether
catch_upis sufficient orrebuildis safer - run the replay or rebuild with an explicit reason
- verify projections return to
pending_events = 0 - re-read the affected trace or graph surface
Scenario 4: Suspected Event-Log Damage¶
Symptoms:
- invalid payload-hash drift
- impossible trace ordering
- restore drill or parity behavior suggests source-of-truth damage
Operator flow:
- stop writes
- preserve the broken database for investigation
- restore the latest known-good Postgres backup
- restart DecisionGraph
- verify health and trace readability
- rebuild projections if needed
Never try to repair the event log manually in place unless you are doing deliberate low-level forensics outside the supported operator flow.
Scenario 5: Failed Upgrade¶
Symptoms:
- the new release boots but smoke checks fail
- migrations succeed but core reads or projection health regress
Operator flow:
- stop the upgraded runtime
- restore the pre-upgrade backup
- return to the previous known-good source release or git SHA
- restart the runtime
- re-run smoke tests
This is the official rollback model.
Manual Actions That Are Still Required¶
Phase 8 is honest about what is still operator-driven:
- deciding whether a replay run left behind after a crash should be ignored, retried, or replaced with rebuild
- deciding whether a projection problem is semantic or transient
- restoring the correct backup artifact after failed upgrades
- maintaining and testing service-account configuration for non-dev installs
Recovery Checklist¶
Use the detailed checklist in:
docs/operations/RESTART_AND_RECOVERY_CHECKLIST.md
That checklist is the operational gate for declaring the node healthy again.