DecisionGraph BEAM Master Plan¶

Direction¶

Primary language choice: - Elixir

Runtime choice: - BEAM / OTP

Position on Gleam: - Do not use Gleam as the primary implementation language for this project. - Re-evaluate Gleam only after the Elixir platform is stable, and only for small pure libraries if it gives a clear advantage.

Why this path: - The current codebase is already strong as a deterministic reference implementation in Python. - The biggest upside from BEAM is not syntax; it is supervision, fault tolerance, concurrency, background workers, and long-running service architecture. - The smartest move is to turn DecisionGraph into a serious Elixir platform without throwing away the Python semantics too early.

End Product¶

The end product is a distributed decision intelligence platform with:

append-only decision event ingestion
deterministic replay and projection rebuilding
real-time projection workers and health monitoring
precedent search and decision similarity workflows
human approval and exception flows
investigation-grade trace and graph exploration
multi-tenant APIs and operational controls
live operator UI with streaming updates
Python SDK and service APIs for AI agents and automation systems
a BEAM-native runtime that can scale into a serious production system

Core Principles¶

Keep the current Python implementation as the semantic reference for the frozen core.
Move infrastructure concerns to Elixir first: ingestion, jobs, projections, APIs, realtime, supervision.
Rewrite pure domain semantics into Elixir only when parity evidence and product value justify it.
Use Postgres as the primary production datastore for the BEAM platform.
Treat determinism, replay, and auditability as non-negotiable product features.
Build the project as if it could become the flagship product, not just a library port.

Target Architecture¶

Recommended repo direction: - keep the current Python package as reference-core - add an Elixir umbrella app for the platform runtime

Recommended BEAM app boundaries: - apps/dg_domain for domain types, validation, event semantics, and contract translation - apps/dg_store for Postgres event store, migrations, and persistence adapters - apps/dg_projector for projection workers, cursors, replay, and digest generation - apps/dg_api for external APIs, auth, ingestion, and query endpoints - apps/dg_web for Phoenix LiveView operator UI - apps/dg_observability for telemetry, metrics, tracing, and health surfaces - apps/dg_sdk_bridge for compatibility helpers and Python-facing integration utilities

Program Structure¶

Critical path: - Phase 0 -> Phase 1 -> Phase 2 -> Phase 3 -> Phase 4 -> Phase 5

Parallelizable later path: - Phase 6 and Phase 7 can begin after Phase 5 stabilizes - Phase 8 can begin in parallel with late Phase 6 and Phase 7 work - Phase 9 is optional and should begin only if earlier phases are successful - Phase 10 happens after the product is operationally credible

Phase 0 - North Star and Scope¶

Goal: - lock the architectural stance and define the end-state product clearly enough that implementation can proceed without thrashing

Tasks: - [x] Confirm the official direction: Elixir first, Python as reference, Gleam deferred - [x] Write a one-page product brief for the final platform - [x] Define primary user personas: agent builders, platform teams, compliance, operations - [x] Define the top product differentiators - [x] Define the v1 platform boundary and the post-v1 wishlist - [x] Decide repo strategy: monorepo with Python plus Elixir umbrella - [x] Decide naming conventions for the BEAM apps and public API surfaces - [x] Decide whether hosted deployment is a goal now or later - [x] Decide whether the end product should prioritize self-hosted enterprise or cloud-first operation

Phase exit: - [x] We have a written architecture stance and are not debating language direction anymore - [x] We know what "impressive end product" means in concrete terms

Phase 1 - Freeze the Semantic Reference¶

Goal: - turn the Python codebase into the official migration oracle

Tasks: - [x] Freeze the event envelope contract at the reference layer - [x] Freeze payload shape rules for all current event types - [x] Freeze trace ordering, idempotency, and replay semantics - [x] Expand golden fixtures for happy paths, failure cases, and edge conditions - [x] Add more parity-oriented tests for projections, digests, and query determinism - [x] Add explicit cross-backend tests for SQLite and Postgres behavior where applicable - [x] Document the exact invariants that the Elixir implementation must preserve - [x] Add "reference semantics" docs for append, project, replay, and query behavior - [x] Add exportable fixture bundles that Elixir tests can consume directly - [x] Tag a release or internal baseline commit as the semantic reference checkpoint

Phase exit: - [x] The Python implementation is treated as the truth source for platform parity - [x] We can prove whether a BEAM port matches or deviates

Phase 2 - Bootstrap the Elixir Platform¶

Goal: - stand up a production-grade Elixir foundation without rewriting the core logic yet

Tasks: - [x] Create the Elixir umbrella project in the repo - [x] Set up OTP application boundaries for domain, store, projector, API, and web - [x] Configure Elixir formatter, Credo, Dialyzer, and test conventions - [x] Add property-based testing for concurrency-sensitive platform behavior - [x] Add local development orchestration for Postgres and any required service dependencies - [x] Add CI for Elixir compile, lint, type analysis, and tests - [x] Define config strategy for dev, test, staging, and prod - [x] Add OpenTelemetry baseline instrumentation - [x] Add structured logging and request correlation conventions - [x] Add architectural docs for supervision trees and process ownership

Phase exit: - [x] The repo can build and test both Python and Elixir surfaces cleanly - [x] The Elixir side is ready to receive real platform features

Phase 3 - Build the BEAM Event Store¶

Goal: - implement the authoritative BEAM-side write model and persistence layer

Tasks: - [x] Design the Postgres schema for event log, cursors, projections, and metadata - [x] Implement append-only event persistence in Elixir - [x] Implement idempotency handling with the same semantics as the Python reference - [x] Implement trace sequence monotonicity enforcement - [x] Implement event listing, filtering, and batch iteration APIs - [x] Implement migration management for the Elixir platform database - [x] Add write-path telemetry and failure classification - [x] Add store-level concurrency tests under contention - [x] Add parity tests that replay Python fixture envelopes through the Elixir store - [x] Benchmark append throughput and batch-read throughput

Phase exit: - [x] The Elixir store can safely accept and read events with reference-level semantics - [x] The store is production-capable for the next phases

Phase 4 - Build the Projection Runtime¶

Goal: - use OTP where it matters most: supervised projection workers, replay coordinators, and projection health management

Tasks: - [x] Implement projection worker processes supervised by OTP - [x] Implement cursor tracking and projection lag management - [x] Implement incremental catch-up processing - [x] Implement full replay and rebuild flows - [x] Implement deterministic digest generation in Elixir - [x] Implement BEAM-side trace summary projection - [x] Implement BEAM-side context graph projection - [x] Implement BEAM-side precedent index projection - [x] Implement retry, backoff, and dead-letter handling for projection failures - [x] Implement projection-status and replay admin surfaces in Elixir - [x] Add parity tests comparing Elixir projection outputs to Python golden data - [x] Add load tests for large replay and catch-up scenarios

Phase exit: - [x] The Elixir runtime can ingest, project, replay, and report health with strong confidence - [x] We can compare Elixir projection state against Python reference outputs

Phase 5 - Build the Service API¶

Goal: - turn DecisionGraph into a real platform service rather than just a library

Tasks: - [x] Create the Phoenix API application - [x] Expose event ingestion endpoints - [x] Expose trace read endpoints - [x] Expose context graph query endpoints - [x] Expose precedent search endpoints - [x] Expose projection health endpoints - [x] Expose replay and admin control endpoints with strong safeguards - [x] Add API authentication and service-account support - [x] Add tenant-aware authorization rules - [x] Add rate limiting and abuse protection - [x] Generate OpenAPI docs for the public surface - [x] Add end-to-end API tests using real Postgres and projector workers

Phase exit: - [x] The project is now a usable network service, not just a local library - [x] External systems can write to and query DecisionGraph through stable APIs

Phase 6 - Build the Operator UI¶

Goal: - make the platform feel impressive and investigation-grade

Tasks: - [x] Create the Phoenix LiveView operator console - [x] Build a trace explorer with timeline and payload inspection - [x] Build a context graph visualizer - [x] Build a precedent browser and comparison view - [x] Build a projection health dashboard - [x] Build a replay console with digest comparison output - [x] Build a policy and exception review interface - [x] Build a live event stream view for operators - [x] Build tenant and environment status pages - [x] Add polished UI states for loading, failure, and stale projections

Phase exit: - [x] Operators can investigate real traces and system state from one place - [x] The project looks and behaves like a serious platform product

Phase 7 - Human Approval and Workflow Layer¶

Goal: - move from passive trace storage to active decision operations

Tasks: - [x] Implement approval queues and reviewer inboxes - [x] Implement exception request workflows - [x] Implement escalation rules and SLA timers - [x] Implement comments and evidence attachments - [x] Implement manual override flows with full audit capture - [x] Implement notifications for approvals, failures, and escalation deadlines - [x] Implement policy simulation and dry-run workflows - [x] Implement workflow templates for common business processes - [x] Implement replay plus review flows for incident analysis - [x] Add audit-focused exports for approvals and overrides

Phase exit: - [x] DecisionGraph is now an active control plane for decision workflows - [x] Human-in-the-loop operation is first-class, not bolted on

Phase 8 - Self-Hosted Reliability, Operations, and Distribution¶

Goal: - make the platform operationally credible as a GitHub-downloadable self-hosted system

Tasks: - [x] Define the first supported self-hosted topology - [x] Add install and bootstrap guidance for GitHub users - [x] Define backup, restore, retention, and archival strategy - [x] Add upgrade and rollback procedures - [x] Add single-node restart and recovery drills - [x] Add performance targets and sizing guidance for local hosting - [x] Add observability guidance for self-hosted operators - [x] Add packaging and release validation for self-hosted distribution - [x] Add local-hosting benchmarks and resilience notes - [x] Document deferred hosted or multi-tenant concerns explicitly

Phase exit: - [x] The platform is easy to install, run, back up, and recover in a self-hosted setup - [x] We understand the limits and recovery behavior of the supported self-hosted topology

Phase 9 - Optional BEAM-Native Semantic Core¶

Goal: - decide whether to keep Python as the permanent semantic reference or migrate core semantics fully into Elixir

Tasks: - [x] Decide whether a full semantic rewrite is worth it after the platform proves itself - [x] Inventory and document the semantic ownership split between Python and BEAM - [x] Confirm the existing BEAM pure semantic primitives already required for runtime parity - [x] Run the full parity harness against Python fixtures and digests - [x] Refuse to switch authoritative semantics until zero-diff or explicitly accepted diffs are proven - [x] Keep Python as the permanent semantic reference and local embedded surface for the frozen Phase 1 scope - [x] Publish bridge, rollback, and governance rules for any future revisit of semantic authority - [x] Re-evaluate whether Gleam has a role for small pure logic libraries at this stage and keep it deferred

Phase exit: - [x] We deliberately keep Python as the permanent reference while BEAM owns runtime delivery for the platform - [x] The decision is based on evidence, not enthusiasm

Phase 10 - Productization and Launch¶

Goal: - turn the platform into something launchable, memorable, and hard to ignore as a self-hosted GitHub-downloadable product

Tasks: - [x] Finalize the supported self-hosted deploy story - [x] Finalize installation and upgrade guides - [x] Finalize production runbooks - [x] Finalize versioning and migration policy for the platform APIs - [x] Finalize operator and developer docs - [x] Build an impressive demo environment with realistic traces and workflows - [x] Build benchmark and release-validation showcase materials - [x] Prepare an early-adopter or dogfood path with realistic internal stand-ins - [x] Gather feedback and close the last major product gaps - [x] Cut the first serious platform release

Phase exit: - [x] The project is not just technically interesting; it is product-grade and tagged - [x] We can confidently present it as a standout system

Immediate Next Steps¶

If we start now, this is the recommended order:

[x] Approve this master plan as the working direction
[x] Open Phase 0 as the active execution phase
[x] Create a second execution roadmap for Phase 0 and Phase 1 only
[x] Start by hardening the Python semantic reference before writing BEAM production code
[x] Add the Elixir umbrella only after the parity target is explicit

What We Should Not Do¶

[ ] Do not start with a full Python-to-Elixir rewrite
[ ] Do not introduce Gleam early just because it looks elegant
[ ] Do not build a fancy UI before the write model and projection runtime are solid
[ ] Do not abandon deterministic parity with the current implementation
[ ] Do not confuse concurrency benefits with permission to loosen correctness

Success Criteria¶

We should consider this transformation successful when:

[ ] the Elixir platform can ingest and project events reliably under load
[ ] replay and digest equivalence remain trustworthy
[ ] operators can inspect traces, graphs, precedents, and health in real time
[ ] the system supports human approvals and decision workflows cleanly
[ ] the product feels meaningfully more ambitious than a Python library
[ ] the architecture is strong enough that adding new decision products becomes easier, not harder