Observability Dashboards¶
Purpose¶
This guide explains which DecisionGraph health surfaces a self-hosted operator should actually look at.
The supported topology does not need a sprawling monitoring stack. It needs a small number of trustworthy views.
Primary Health Views¶
Deployment Health¶
Use:
/api/healthz
This is the top-level deployment view. It combines:
- store deployment snapshot
- projector runtime snapshot
- projection health for the default tenant
- deployment environment metadata
Use it first when answering:
- is the node up
- is the repo connected to Postgres
- is the projector obviously unhealthy
Projection Health¶
Use:
GET /api/v1/projections/health- operator console projection health cards
This is the most important operational view after node liveness. It answers:
- what is the event-log tail
- which projections are stale
- whether failures are open
- whether replay jobs are still active
Operator Console¶
Use:
/
The console is the best single human-facing surface for:
- projection health
- recent traces
- workflow backlog
- replay operations
- environment status
Runtime Snapshots¶
Direct BEAM runtime helpers remain useful during deeper investigation:
DecisionGraph.Store.deployment_snapshot/0DecisionGraph.Projector.runtime_snapshot/0DecisionGraph.Projector.projection_health/1
Log Signals¶
Current log metadata already includes:
request_idtrace_idtenant_idprojectionworkeraccount_idapi_actionjob_idworkflow_id
For self-hosted operations, these fields are enough to correlate:
- request failures
- replay actions
- projector errors
- workflow incidents
Optional OTEL Collector¶
The supported topology keeps OTEL optional.
If you start the repo otel-collector service, you get a cleaner path for forwarding telemetry, but the install is still supported without it.
Minimal Dashboard Layout¶
If you build a lightweight local dashboard or external monitor, start with:
- node up or down via
/api/healthz - projection stale count
- projection open failure count
- active replay job count
- workflow open count and escalated count
That is enough to answer whether the node is safe to trust.
Investigation Order¶
When something looks wrong:
- check
/api/healthz - check projection health
- check recent logs with request and projection metadata
- inspect replay status if lag or failures are present
- inspect workflow backlog only after runtime health is understood