Skip to content

DecisionGraph BEAM Master Plan

Direction

Primary language choice: - Elixir

Runtime choice: - BEAM / OTP

Position on Gleam: - Do not use Gleam as the primary implementation language for this project. - Re-evaluate Gleam only after the Elixir platform is stable, and only for small pure libraries if it gives a clear advantage.

Why this path: - The current codebase is already strong as a deterministic reference implementation in Python. - The biggest upside from BEAM is not syntax; it is supervision, fault tolerance, concurrency, background workers, and long-running service architecture. - The smartest move is to turn DecisionGraph into a serious Elixir platform without throwing away the Python semantics too early.

End Product

The end product is a distributed decision intelligence platform with:

  • append-only decision event ingestion
  • deterministic replay and projection rebuilding
  • real-time projection workers and health monitoring
  • precedent search and decision similarity workflows
  • human approval and exception flows
  • investigation-grade trace and graph exploration
  • multi-tenant APIs and operational controls
  • live operator UI with streaming updates
  • Python SDK and service APIs for AI agents and automation systems
  • a BEAM-native runtime that can scale into a serious production system

Core Principles

  • Keep the current Python implementation as the semantic reference for the frozen core.
  • Move infrastructure concerns to Elixir first: ingestion, jobs, projections, APIs, realtime, supervision.
  • Rewrite pure domain semantics into Elixir only when parity evidence and product value justify it.
  • Use Postgres as the primary production datastore for the BEAM platform.
  • Treat determinism, replay, and auditability as non-negotiable product features.
  • Build the project as if it could become the flagship product, not just a library port.

Target Architecture

Recommended repo direction: - keep the current Python package as reference-core - add an Elixir umbrella app for the platform runtime

Recommended BEAM app boundaries: - apps/dg_domain for domain types, validation, event semantics, and contract translation - apps/dg_store for Postgres event store, migrations, and persistence adapters - apps/dg_projector for projection workers, cursors, replay, and digest generation - apps/dg_api for external APIs, auth, ingestion, and query endpoints - apps/dg_web for Phoenix LiveView operator UI - apps/dg_observability for telemetry, metrics, tracing, and health surfaces - apps/dg_sdk_bridge for compatibility helpers and Python-facing integration utilities

Program Structure

Critical path: - Phase 0 -> Phase 1 -> Phase 2 -> Phase 3 -> Phase 4 -> Phase 5

Parallelizable later path: - Phase 6 and Phase 7 can begin after Phase 5 stabilizes - Phase 8 can begin in parallel with late Phase 6 and Phase 7 work - Phase 9 is optional and should begin only if earlier phases are successful - Phase 10 happens after the product is operationally credible

Phase 0 - North Star and Scope

Goal: - lock the architectural stance and define the end-state product clearly enough that implementation can proceed without thrashing

Tasks: - [x] Confirm the official direction: Elixir first, Python as reference, Gleam deferred - [x] Write a one-page product brief for the final platform - [x] Define primary user personas: agent builders, platform teams, compliance, operations - [x] Define the top product differentiators - [x] Define the v1 platform boundary and the post-v1 wishlist - [x] Decide repo strategy: monorepo with Python plus Elixir umbrella - [x] Decide naming conventions for the BEAM apps and public API surfaces - [x] Decide whether hosted deployment is a goal now or later - [x] Decide whether the end product should prioritize self-hosted enterprise or cloud-first operation

Phase exit: - [x] We have a written architecture stance and are not debating language direction anymore - [x] We know what "impressive end product" means in concrete terms

Phase 1 - Freeze the Semantic Reference

Goal: - turn the Python codebase into the official migration oracle

Tasks: - [x] Freeze the event envelope contract at the reference layer - [x] Freeze payload shape rules for all current event types - [x] Freeze trace ordering, idempotency, and replay semantics - [x] Expand golden fixtures for happy paths, failure cases, and edge conditions - [x] Add more parity-oriented tests for projections, digests, and query determinism - [x] Add explicit cross-backend tests for SQLite and Postgres behavior where applicable - [x] Document the exact invariants that the Elixir implementation must preserve - [x] Add "reference semantics" docs for append, project, replay, and query behavior - [x] Add exportable fixture bundles that Elixir tests can consume directly - [x] Tag a release or internal baseline commit as the semantic reference checkpoint

Phase exit: - [x] The Python implementation is treated as the truth source for platform parity - [x] We can prove whether a BEAM port matches or deviates

Phase 2 - Bootstrap the Elixir Platform

Goal: - stand up a production-grade Elixir foundation without rewriting the core logic yet

Tasks: - [x] Create the Elixir umbrella project in the repo - [x] Set up OTP application boundaries for domain, store, projector, API, and web - [x] Configure Elixir formatter, Credo, Dialyzer, and test conventions - [x] Add property-based testing for concurrency-sensitive platform behavior - [x] Add local development orchestration for Postgres and any required service dependencies - [x] Add CI for Elixir compile, lint, type analysis, and tests - [x] Define config strategy for dev, test, staging, and prod - [x] Add OpenTelemetry baseline instrumentation - [x] Add structured logging and request correlation conventions - [x] Add architectural docs for supervision trees and process ownership

Phase exit: - [x] The repo can build and test both Python and Elixir surfaces cleanly - [x] The Elixir side is ready to receive real platform features

Phase 3 - Build the BEAM Event Store

Goal: - implement the authoritative BEAM-side write model and persistence layer

Tasks: - [x] Design the Postgres schema for event log, cursors, projections, and metadata - [x] Implement append-only event persistence in Elixir - [x] Implement idempotency handling with the same semantics as the Python reference - [x] Implement trace sequence monotonicity enforcement - [x] Implement event listing, filtering, and batch iteration APIs - [x] Implement migration management for the Elixir platform database - [x] Add write-path telemetry and failure classification - [x] Add store-level concurrency tests under contention - [x] Add parity tests that replay Python fixture envelopes through the Elixir store - [x] Benchmark append throughput and batch-read throughput

Phase exit: - [x] The Elixir store can safely accept and read events with reference-level semantics - [x] The store is production-capable for the next phases

Phase 4 - Build the Projection Runtime

Goal: - use OTP where it matters most: supervised projection workers, replay coordinators, and projection health management

Tasks: - [x] Implement projection worker processes supervised by OTP - [x] Implement cursor tracking and projection lag management - [x] Implement incremental catch-up processing - [x] Implement full replay and rebuild flows - [x] Implement deterministic digest generation in Elixir - [x] Implement BEAM-side trace summary projection - [x] Implement BEAM-side context graph projection - [x] Implement BEAM-side precedent index projection - [x] Implement retry, backoff, and dead-letter handling for projection failures - [x] Implement projection-status and replay admin surfaces in Elixir - [x] Add parity tests comparing Elixir projection outputs to Python golden data - [x] Add load tests for large replay and catch-up scenarios

Phase exit: - [x] The Elixir runtime can ingest, project, replay, and report health with strong confidence - [x] We can compare Elixir projection state against Python reference outputs

Phase 5 - Build the Service API

Goal: - turn DecisionGraph into a real platform service rather than just a library

Tasks: - [x] Create the Phoenix API application - [x] Expose event ingestion endpoints - [x] Expose trace read endpoints - [x] Expose context graph query endpoints - [x] Expose precedent search endpoints - [x] Expose projection health endpoints - [x] Expose replay and admin control endpoints with strong safeguards - [x] Add API authentication and service-account support - [x] Add tenant-aware authorization rules - [x] Add rate limiting and abuse protection - [x] Generate OpenAPI docs for the public surface - [x] Add end-to-end API tests using real Postgres and projector workers

Phase exit: - [x] The project is now a usable network service, not just a local library - [x] External systems can write to and query DecisionGraph through stable APIs

Phase 6 - Build the Operator UI

Goal: - make the platform feel impressive and investigation-grade

Tasks: - [x] Create the Phoenix LiveView operator console - [x] Build a trace explorer with timeline and payload inspection - [x] Build a context graph visualizer - [x] Build a precedent browser and comparison view - [x] Build a projection health dashboard - [x] Build a replay console with digest comparison output - [x] Build a policy and exception review interface - [x] Build a live event stream view for operators - [x] Build tenant and environment status pages - [x] Add polished UI states for loading, failure, and stale projections

Phase exit: - [x] Operators can investigate real traces and system state from one place - [x] The project looks and behaves like a serious platform product

Phase 7 - Human Approval and Workflow Layer

Goal: - move from passive trace storage to active decision operations

Tasks: - [x] Implement approval queues and reviewer inboxes - [x] Implement exception request workflows - [x] Implement escalation rules and SLA timers - [x] Implement comments and evidence attachments - [x] Implement manual override flows with full audit capture - [x] Implement notifications for approvals, failures, and escalation deadlines - [x] Implement policy simulation and dry-run workflows - [x] Implement workflow templates for common business processes - [x] Implement replay plus review flows for incident analysis - [x] Add audit-focused exports for approvals and overrides

Phase exit: - [x] DecisionGraph is now an active control plane for decision workflows - [x] Human-in-the-loop operation is first-class, not bolted on

Phase 8 - Self-Hosted Reliability, Operations, and Distribution

Goal: - make the platform operationally credible as a GitHub-downloadable self-hosted system

Tasks: - [x] Define the first supported self-hosted topology - [x] Add install and bootstrap guidance for GitHub users - [x] Define backup, restore, retention, and archival strategy - [x] Add upgrade and rollback procedures - [x] Add single-node restart and recovery drills - [x] Add performance targets and sizing guidance for local hosting - [x] Add observability guidance for self-hosted operators - [x] Add packaging and release validation for self-hosted distribution - [x] Add local-hosting benchmarks and resilience notes - [x] Document deferred hosted or multi-tenant concerns explicitly

Phase exit: - [x] The platform is easy to install, run, back up, and recover in a self-hosted setup - [x] We understand the limits and recovery behavior of the supported self-hosted topology

Phase 9 - Optional BEAM-Native Semantic Core

Goal: - decide whether to keep Python as the permanent semantic reference or migrate core semantics fully into Elixir

Tasks: - [x] Decide whether a full semantic rewrite is worth it after the platform proves itself - [x] Inventory and document the semantic ownership split between Python and BEAM - [x] Confirm the existing BEAM pure semantic primitives already required for runtime parity - [x] Run the full parity harness against Python fixtures and digests - [x] Refuse to switch authoritative semantics until zero-diff or explicitly accepted diffs are proven - [x] Keep Python as the permanent semantic reference and local embedded surface for the frozen Phase 1 scope - [x] Publish bridge, rollback, and governance rules for any future revisit of semantic authority - [x] Re-evaluate whether Gleam has a role for small pure logic libraries at this stage and keep it deferred

Phase exit: - [x] We deliberately keep Python as the permanent reference while BEAM owns runtime delivery for the platform - [x] The decision is based on evidence, not enthusiasm

Phase 10 - Productization and Launch

Goal: - turn the platform into something launchable, memorable, and hard to ignore as a self-hosted GitHub-downloadable product

Tasks: - [x] Finalize the supported self-hosted deploy story - [x] Finalize installation and upgrade guides - [x] Finalize production runbooks - [x] Finalize versioning and migration policy for the platform APIs - [x] Finalize operator and developer docs - [x] Build an impressive demo environment with realistic traces and workflows - [x] Build benchmark and release-validation showcase materials - [x] Prepare an early-adopter or dogfood path with realistic internal stand-ins - [x] Gather feedback and close the last major product gaps - [x] Cut the first serious platform release

Phase exit: - [x] The project is not just technically interesting; it is product-grade and tagged - [x] We can confidently present it as a standout system

Immediate Next Steps

If we start now, this is the recommended order:

  • [x] Approve this master plan as the working direction
  • [x] Open Phase 0 as the active execution phase
  • [x] Create a second execution roadmap for Phase 0 and Phase 1 only
  • [x] Start by hardening the Python semantic reference before writing BEAM production code
  • [x] Add the Elixir umbrella only after the parity target is explicit

What We Should Not Do

  • [ ] Do not start with a full Python-to-Elixir rewrite
  • [ ] Do not introduce Gleam early just because it looks elegant
  • [ ] Do not build a fancy UI before the write model and projection runtime are solid
  • [ ] Do not abandon deterministic parity with the current implementation
  • [ ] Do not confuse concurrency benefits with permission to loosen correctness

Success Criteria

We should consider this transformation successful when:

  • [ ] the Elixir platform can ingest and project events reliably under load
  • [ ] replay and digest equivalence remain trustworthy
  • [ ] operators can inspect traces, graphs, precedents, and health in real time
  • [ ] the system supports human approvals and decision workflows cleanly
  • [ ] the product feels meaningfully more ambitious than a Python library
  • [ ] the architecture is strong enough that adding new decision products becomes easier, not harder