Skip to content

DecisionGraph

Disaster Recovery

aliuyar1234/DecisionGraph

Disaster Recovery¶

Purpose¶

This runbook covers the practical recovery flow for the supported self-hosted DecisionGraph deployment.

It is intentionally single-node and Postgres-first.

Recovery Priorities¶

In order:

protect the append-only event log
restore service reachability
restore projection readability
restore workflow and operator-console usability
record what happened and what manual actions were required

Scenario 1: App Node Restart Or Host Reboot¶

Symptoms:

Phoenix is down
/api/healthz is unreachable
Postgres is still healthy

Operator flow:

restart the BEAM runtime
wait for /api/healthz to return 200
check GET /api/v1/projections/health
read one known trace
read workflow inbox if workflows are in use

Expected outcome:

persisted traces remain available
workflow items remain available
projections are current or catching up from durable cursors

Scenario 2: Postgres Temporarily Unavailable¶

Symptoms:

health degrades or read/write surfaces fail
logs show repo or query failures

Operator flow:

restore Postgres availability first
restart the app node if the runtime does not reconnect cleanly
check /api/healthz
check projection health for lag or open failures
run an explicit replay only if projections remain stale after recovery

Scenario 3: Projection Corruption Or Stale Reads¶

Symptoms:

GET /api/v1/projections/health shows stale projections or open failures
trace reads return 409 or look incomplete
operator console shows replay or digest concerns

Operator flow:

inspect projection health and open failure rows
decide whether catch_up is sufficient or rebuild is safer
run the replay or rebuild with an explicit reason
verify projections return to pending_events = 0
re-read the affected trace or graph surface

Scenario 4: Suspected Event-Log Damage¶

Symptoms:

invalid payload-hash drift
impossible trace ordering
restore drill or parity behavior suggests source-of-truth damage

Operator flow:

stop writes
preserve the broken database for investigation
restore the latest known-good Postgres backup
restart DecisionGraph
verify health and trace readability
rebuild projections if needed

Never try to repair the event log manually in place unless you are doing deliberate low-level forensics outside the supported operator flow.

Scenario 5: Failed Upgrade¶

Symptoms:

the new release boots but smoke checks fail
migrations succeed but core reads or projection health regress

Operator flow:

stop the upgraded runtime
restore the pre-upgrade backup
return to the previous known-good source release or git SHA
restart the runtime
re-run smoke tests

This is the official rollback model.

Manual Actions That Are Still Required¶

Phase 8 is honest about what is still operator-driven:

deciding whether a replay run left behind after a crash should be ignored, retried, or replaced with rebuild
deciding whether a projection problem is semantic or transient
restoring the correct backup artifact after failed upgrades
maintaining and testing service-account configuration for non-dev installs

Recovery Checklist¶

Use the detailed checklist in:

docs/operations/RESTART_AND_RECOVERY_CHECKLIST.md

That checklist is the operational gate for declaring the node healthy again.