Skip to content

Operations Runbook and Reproducibility

Self-Hosted BEAM Operations

For the current BEAM self-hosted platform shape, start with:

The rest of this page remains the cross-backend recovery and reproducibility summary.

Backup and Restore

SQLite

  • Backup:
  • Stop writers.
  • Copy database file and associated -wal/-shm files if WAL is enabled.
  • Restore:
  • Replace DB files with backup set.
  • Run python -m decisiongraph replay <db> and verify digests.

PostgreSQL

  • Backup:
  • Use pg_dump for logical backups or storage snapshots for physical backups.
  • Restore:
  • Restore dump/snapshot.
  • Re-run projection replay verification and check digest outputs.

Migration Rollback Strategy

  • Never partially apply migrations outside transaction boundaries.
  • If migration verification fails:
  • Stop writes.
  • Restore latest known-good backup.
  • Re-run application with previous release artifact.
  • Confirm event log and projection digests.

Corruption Recovery

If projection tables are inconsistent:

  1. Keep event log as source of truth.
  2. Inspect current lag before taking action:
  3. python -m decisiongraph projection-status <db>
  4. python -m decisiongraph projection-status <db> --include-digests
  5. Run projection rebuild:
  6. python -m decisiongraph replay <db>
  7. Verify all digest outputs are present and stable across repeated runs.

If event log corruption is detected:

  1. Quarantine the affected DB snapshot.
  2. Restore from backup.
  3. Validate idempotency/traces with integration checks before re-enabling writes.

Reproducibility Guide

Use deterministic digests to prove replay equivalence:

  1. Export or snapshot the event log.
  2. Rebuild projections in a clean environment:
  3. python -m decisiongraph replay <db>
  4. Record digest values:
  5. context_graph
  6. trace_summary
  7. precedent_index
  8. full_projection
  9. Repeat replay in another environment and compare digests.
  10. Treat digest mismatch as release-blocking until root cause is resolved.

Multi-Writer Projection Monitoring

When multiple writers may append events outside the current process, use projection health checks to detect lag before projection-backed reads:

  1. Inspect current state from the CLI:
  2. python -m decisiongraph projection-status <db>
  3. Or inspect in process:
  4. DecisionGraph.get_projection_health(include_digests=True)
  5. If is_stale is true, either:
  6. call sync_projections() to catch up incrementally, or
  7. call replay_projections() if you suspect projection corruption
  8. Treat non-zero pending_events as an operational signal that projection-backed queries may be stale until catch-up completes.

BEAM Phase 3 Local Store Workflow

For the Elixir event-store phase, the local operating loop is:

  1. Start Postgres from the repo root:
  2. docker compose up postgres -d
  3. Run Elixir store tests:
  4. cd beam
  5. mix test apps/dg_store/test
  6. Run the local store benchmark in an isolated test database:
  7. set MIX_ENV=test
  8. mix dg.store.bench --traces 100 --events-per-trace 8 --batch-size 250 --payload-bytes 512

Phase 3 operator notes:

  • the benchmark command is intended for baseline comparison, not for load-testing claims
  • dg_projection_cursors exists before BEAM projection runtime work so Phase 4 can inherit a stable cursor table
  • projection materialization remains a later phase; Phase 3 health is about store correctness, not projector lag