Operations Runbook and Reproducibility¶
Self-Hosted BEAM Operations¶
For the current BEAM self-hosted platform shape, start with:
- Self-Hosted Install
- Backup and Restore
- Upgrade and Rollback
- Disaster Recovery
- Restart and Recovery Checklist
- SLOs and Alerting
- Observability Dashboards
- Self-Hosted Release Checklist
- Phase 10 Release Validation
- First Release Limitations
- Early Adopter Feedback
- Post Release Review
- API Runtime
- Projection Runtime
- Workflow Runtime
The rest of this page remains the cross-backend recovery and reproducibility summary.
Backup and Restore¶
SQLite¶
- Backup:
- Stop writers.
- Copy database file and associated
-wal/-shmfiles if WAL is enabled. - Restore:
- Replace DB files with backup set.
- Run
python -m decisiongraph replay <db>and verify digests.
PostgreSQL¶
- Backup:
- Use
pg_dumpfor logical backups or storage snapshots for physical backups. - Restore:
- Restore dump/snapshot.
- Re-run projection replay verification and check digest outputs.
Migration Rollback Strategy¶
- Never partially apply migrations outside transaction boundaries.
- If migration verification fails:
- Stop writes.
- Restore latest known-good backup.
- Re-run application with previous release artifact.
- Confirm event log and projection digests.
Corruption Recovery¶
If projection tables are inconsistent:
- Keep event log as source of truth.
- Inspect current lag before taking action:
python -m decisiongraph projection-status <db>python -m decisiongraph projection-status <db> --include-digests- Run projection rebuild:
python -m decisiongraph replay <db>- Verify all digest outputs are present and stable across repeated runs.
If event log corruption is detected:
- Quarantine the affected DB snapshot.
- Restore from backup.
- Validate idempotency/traces with integration checks before re-enabling writes.
Reproducibility Guide¶
Use deterministic digests to prove replay equivalence:
- Export or snapshot the event log.
- Rebuild projections in a clean environment:
python -m decisiongraph replay <db>- Record digest values:
context_graphtrace_summaryprecedent_indexfull_projection- Repeat replay in another environment and compare digests.
- Treat digest mismatch as release-blocking until root cause is resolved.
Multi-Writer Projection Monitoring¶
When multiple writers may append events outside the current process, use projection health checks to detect lag before projection-backed reads:
- Inspect current state from the CLI:
python -m decisiongraph projection-status <db>- Or inspect in process:
DecisionGraph.get_projection_health(include_digests=True)- If
is_staleistrue, either: - call
sync_projections()to catch up incrementally, or - call
replay_projections()if you suspect projection corruption - Treat non-zero
pending_eventsas an operational signal that projection-backed queries may be stale until catch-up completes.
BEAM Phase 3 Local Store Workflow¶
For the Elixir event-store phase, the local operating loop is:
- Start Postgres from the repo root:
docker compose up postgres -d- Run Elixir store tests:
cd beammix test apps/dg_store/test- Run the local store benchmark in an isolated test database:
set MIX_ENV=testmix dg.store.bench --traces 100 --events-per-trace 8 --batch-size 250 --payload-bytes 512
Phase 3 operator notes:
- the benchmark command is intended for baseline comparison, not for load-testing claims
dg_projection_cursorsexists before BEAM projection runtime work so Phase 4 can inherit a stable cursor table- projection materialization remains a later phase; Phase 3 health is about store correctness, not projector lag