Skip to content

Restart And Recovery Checklist

Purpose

This checklist defines the minimum verified steps for deciding that a single-node self-hosted DecisionGraph deployment is healthy again after restart or recovery.

Checklist

  • the BEAM runtime starts without manual code changes
  • /api/healthz returns 200
  • DecisionGraph.Store.deployment_snapshot/0 or the health response shows repo_started? == true
  • GET /api/v1/projections/health succeeds for the tenant you care about
  • projection pending_events is 0, or the node is visibly catching up without new failures
  • no unexpected open rows exist in dg_projection_failures
  • no unexpected replay runs remain stuck in queued or running
  • one known trace can be read successfully
  • one known workflow item can be read successfully when workflows exist
  • the operator console loads and reflects the same tenant state
  • recent logs show request IDs and no uncontrolled retry storm

Tested Phase 8 Drill

The current Phase 8 restart drill validated all of the following on the supported topology:

  • a trace written before restart remained readable after restart
  • the related workflow item remained readable after restart
  • projection health returned with pending_events = 0
  • the event-log tail remained unchanged across the controlled restart

Reference evidence:

  • docs/benchmarks/PHASE_8_RESILIENCE_BASELINE.md