Phase 8 Resilience Baseline¶
Purpose¶
This note records the first evidence-backed resilience drill set for the supported self-hosted topology.
It is meant to answer one question:
- can a GitHub-downloaded, single-node DecisionGraph install be started, backed up, restored, restarted, and trusted again without hidden steps
Captured Environment¶
- capture date:
2026-03-07 - commit SHA:
e642ccee946d2280d4b3953f00b38d2e57fcf2d8 - host OS:
win32/nt - topology: one BEAM node plus one Docker-hosted Postgres 16.11 instance
Install Validation¶
Command path:
docker compose up postgres otel-collector -d
cd beam
mix setup
Result:
- local dependencies started successfully
mix setupcompleted successfully on the supported source path- the current umbrella alias no longer emits deprecated
mix cmd --appwarnings - this validated the current bootstrap path documented in
docs/operations/SELF_HOSTED_INSTALL.md
Backup Artifact Drill¶
Safe container-side backup flow used for the drill:
docker compose exec -T postgres sh -lc "pg_dump -U decisiongraph -d decisiongraph_beam_dev --format=custom -f /tmp/phase8_valid.dump"
docker compose cp postgres:/tmp/phase8_valid.dump .tmp/phase8/phase8_valid.dump
Observed backup artifact:
- path:
.tmp/phase8/phase8_valid.dump - size:
144057bytes
Important finding:
- Windows PowerShell host-side redirection is not a safe default for
pg_dump --format=custom - the release runbook now recommends container-side dump creation plus
docker cp
Restore Drill¶
Restore path:
docker compose exec -T postgres psql -U decisiongraph -d postgres -c "CREATE DATABASE decisiongraph_phase8_restore OWNER decisiongraph"
docker compose exec -T postgres pg_restore -U decisiongraph -d decisiongraph_phase8_restore --clean --if-exists --no-owner --no-privileges /tmp/phase8_valid.dump
Verification:
- source
dg_event_logrow count:1185 - restored
dg_event_logrow count:1185
Result:
- restore succeeded against a throwaway database
- row counts matched for the authoritative event log
Migration And Upgrade Validation¶
Current Phase 8 release validation is source-based. For this release posture, upgrade validation means:
- the source artifact bootstraps cleanly
- migrations run cleanly or no-op cleanly
- rollback remains backup-first rather than in-place down-migration-first
The current install and restore drills validated that posture on the supported topology.
Controlled Restart Drill¶
Drill summary:
- started
mix phx.server - confirmed
/api/healthzreturned200 - wrote trace
phase8-restart-traceplus anExceptionRequestedevent through the authenticated HTTP API - confirmed trace and projection health were readable
- stopped the BEAM node
- started it again
- confirmed the same trace and its workflow item were still readable
Recovered state after restart:
- recovered trace events:
3 - recovered workflow id:
phase8-restart-trace:exception:ex-phase8-1 - recovered workflow status:
requested - event log last seq:
2168 - projection pending events:
0 - stale projections:
0
Result:
- persisted trace and workflow data survived the controlled BEAM restart
- the node returned to a healthy projection posture after restart
Automated Recovery Evidence¶
Targeted projector recovery suite:
cd beam
set MIX_ENV=test
mix test apps/dg_projector/test/decision_graph/projector/integration_test.exs
Result:
5 tests, 0 failures
This suite covers:
- durable cursor resume
- replay run persistence
- worker restart from durable cursor state
- operator-visible projection failure recording
Authenticated service E2E suite:
cd beam
set MIX_ENV=test
set DG_RUN_SERVICE_E2E=1
mix test apps/dg_web/test/decision_graph_web/controllers/api_service_e2e_test.exs
Result:
3 tests, 0 failures
This suite covers:
- authenticated event writes
- projection-backed trace reads
- workflow inbox and approval flows
- workflow studio review creation
Benchmark Evidence¶
Phase 8 local-hosting performance evidence now lives in:
docs/benchmarks/PHASE_8_CAPACITY_MODEL.md
That note captures the current API and projector throughput baselines for the supported single-node topology.