Skip to content

DecisionGraph

Overview

aliuyar1234/DecisionGraph

Operations Runbook and Reproducibility¶

Self-Hosted BEAM Operations¶

For the current BEAM self-hosted platform shape, start with:

The rest of this page remains the cross-backend recovery and reproducibility summary.

Backup and Restore¶

SQLite¶

Backup:
Stop writers.
Copy database file and associated -wal/-shm files if WAL is enabled.
Restore:
Replace DB files with backup set.
Run python -m decisiongraph replay <db> and verify digests.

PostgreSQL¶

Backup:
Use pg_dump for logical backups or storage snapshots for physical backups.
Restore:
Restore dump/snapshot.
Re-run projection replay verification and check digest outputs.

Migration Rollback Strategy¶

Never partially apply migrations outside transaction boundaries.
If migration verification fails:
Stop writes.
Restore latest known-good backup.
Re-run application with previous release artifact.
Confirm event log and projection digests.

Corruption Recovery¶

If projection tables are inconsistent:

Keep event log as source of truth.
Inspect current lag before taking action:
python -m decisiongraph projection-status <db>
python -m decisiongraph projection-status <db> --include-digests
Run projection rebuild:
python -m decisiongraph replay <db>
Verify all digest outputs are present and stable across repeated runs.

If event log corruption is detected:

Quarantine the affected DB snapshot.
Restore from backup.
Validate idempotency/traces with integration checks before re-enabling writes.

Reproducibility Guide¶

Use deterministic digests to prove replay equivalence:

Export or snapshot the event log.
Rebuild projections in a clean environment:
python -m decisiongraph replay <db>
Record digest values:
context_graph
trace_summary
precedent_index
full_projection
Repeat replay in another environment and compare digests.
Treat digest mismatch as release-blocking until root cause is resolved.

Multi-Writer Projection Monitoring¶

When multiple writers may append events outside the current process, use projection health checks to detect lag before projection-backed reads:

Inspect current state from the CLI:
python -m decisiongraph projection-status <db>
Or inspect in process:
DecisionGraph.get_projection_health(include_digests=True)
If is_stale is true, either:
call sync_projections() to catch up incrementally, or
call replay_projections() if you suspect projection corruption
Treat non-zero pending_events as an operational signal that projection-backed queries may be stale until catch-up completes.

BEAM Phase 3 Local Store Workflow¶

For the Elixir event-store phase, the local operating loop is:

Start Postgres from the repo root:
docker compose up postgres -d
Run Elixir store tests:
cd beam
mix test apps/dg_store/test
Run the local store benchmark in an isolated test database:
set MIX_ENV=test
mix dg.store.bench --traces 100 --events-per-trace 8 --batch-size 250 --payload-bytes 512

Phase 3 operator notes:

the benchmark command is intended for baseline comparison, not for load-testing claims
dg_projection_cursors exists before BEAM projection runtime work so Phase 4 can inherit a stable cursor table
projection materialization remains a later phase; Phase 3 health is about store correctness, not projector lag