Single-Node Recovery¶
Purpose¶
This document defines the restart and recovery model for the first supported self-hosted DecisionGraph topology:
- one BEAM node
- one Postgres instance
- optional OTEL collector
The goal is to describe what survives restarts, what must be reloaded, and what an operator should expect after interruptions.
Recovery Principles¶
- Postgres is authoritative for durable state.
- The append-only event log remains the source of truth.
- Projection and workflow state are durable, but they are recoverable from the event log if necessary.
- In-memory worker state is disposable and must never be the only copy of progress.
Component Recovery Model¶
Store¶
What survives restart:
dg_event_loglog_seqprogression- idempotency records embedded in the event log
- projection cursor rows
- workflow tables
What reloads on restart:
- Ecto repo processes
- connection pools
- per-request context
Recovery expectation:
- if Postgres is intact, committed writes remain available after the BEAM node restarts
- if a client retries a write after uncertainty, the normal idempotency path should be used instead of manual duplicate cleanup
API And Web¶
What survives restart:
- nothing API-specific needs in-memory recovery beyond normal runtime config
- request-scoped data is intentionally ephemeral
Recovery expectation:
/api/healthzshould return200once the endpoint is listening and the repo can reach Postgres- persisted traces, workflows, and projection health should be readable again without re-ingesting data
Projector¶
What survives restart:
- durable cursor position in
dg_projection_cursors - digests in
dg_projection_digests - operator-visible run history in
dg_projection_runs - failure records in
dg_projection_failures
What reloads on restart:
- dynamic projection workers
- retry counters
- in-memory replay bookkeeping
Recovery expectation:
- workers restart from the durable cursor rather than replaying from memory assumptions
- stale projections should catch up from the next poll or explicit replay request
- a partially completed replay does not corrupt the event log
Current caveat:
- a run that died mid-process may still appear as
runningindg_projection_runsuntil an operator inspects it and decides whether to start a fresh replay or rebuild
Workflow Runtime¶
What survives restart:
dg_workflow_runtimedg_workflow_itemsdg_workflow_actionsdg_workflow_notifications
What reloads on restart:
- read-triggered workflow refresh execution
- console snapshot assembly
Recovery expectation:
- workflow items remain durable
- the first inbox, detail, export, or console read after restart advances the workflow runtime again if new source events exist
Interrupted Operation Semantics¶
Interrupted Writes¶
If the event commit reached Postgres before the crash:
- the event remains durable
- the projector may simply need to catch up after restart
If the write did not commit and the caller is unsure:
- retry with the same idempotency key
- do not hand-edit the event log
Interrupted Replay Or Rebuild¶
If the node dies during replay:
- progress already written to projection tables and cursors remains durable
- the event log is unchanged
- operator-visible run history stays in
dg_projection_runs
Recovery action:
- inspect projection health and replay status
- if the run is stale or orphaned, start a fresh
catch_uporrebuild - verify the projection reaches the event-log tail
Interrupted Workflow Actions¶
Workflow actions are backed by durable workflow tables and, where applicable, source events.
Recovery action:
- reload the workflow detail view
- confirm whether the action is already present in
dg_workflow_actions - retry only if the intended action was not durably recorded
Healthy After Restart¶
The supported topology is considered healthy after restart when all of these are true:
/api/healthzreturns200DecisionGraph.Store.deployment_snapshot/0reportsrepo_started? == trueGET /api/v1/projections/healthsucceeds for the expected tenant- projections are either current or clearly progressing toward current state
- no unexpected open failure rows are blocking normal operation
- at least one previously persisted trace can still be read
- at least one previously persisted workflow item can still be read when workflows exist
Tested Evidence¶
Phase 8 evidence now includes all of these:
- live source-based node restart drill against
phase8-restart-trace mix test apps/dg_projector/test/decision_graph/projector/integration_test.exsDG_RUN_SERVICE_E2E=1 mix test apps/dg_web/test/decision_graph_web/controllers/api_service_e2e_test.exs
Those proofs establish:
- durable trace persistence across BEAM restart
- workflow inbox recovery after restart
- projection worker restart from durable cursor state
- authenticated end-to-end API and workflow behavior on local Postgres