SLOs And Alerting¶
Purpose¶
This document defines the first practical health targets for the supported self-hosted topology.
These are self-hosted operator targets, not cloud-SaaS promises.
What Matters Most¶
For the supported topology, the most important questions are:
- is the node up
- is Postgres reachable
- are projections current
- can operators read traces and workflows
- can admins still request replay when needed
Target Signals¶
Availability¶
Target:
/api/healthzshould be reachable and return200during normal operation
Operator expectation:
- any sustained
5xxor connection failure on/api/healthzis page-worthy even in a single-node install
Projection Freshness¶
Target:
- normal steady-state operation should keep
pending_events = 0or near-zero for active tenants - stale projections should self-correct quickly after brief write bursts
Operator watch threshold:
- any projection that remains stale for more than a few poll cycles deserves investigation
Warm API Latency¶
Using the current Phase 8 local-hosting capture as a baseline:
GET /api/v1/traces/:trace_idshould remain comfortably sub-100 ms p95 on the supported warm local profileGET /api/v1/projections/healthshould remain comfortably sub-150 ms p95 on the same profilePOST /api/v1/eventsshould remain comfortably sub-50 ms p95 on the same profilePOST /api/v1/admin/replaysacceptance should remain comfortably sub-50 ms p95 on the same profile
These are intentionally conservative compared to the current measured numbers.
Replay Safety¶
Target:
- replay acceptance remains cheap
- rebuild and catch-up durations remain predictable enough to reason about maintenance windows
Operator watch threshold:
- replay or rebuild taking materially longer than the recorded benchmark profile on the same data scale should trigger investigation
First Alerting Rules¶
The first self-hosted alert posture should be simple:
- alert if
/api/healthzfails repeatedly - alert if any projection health response shows open failures
- alert if projection lag stays non-zero for a sustained period
- alert if replay jobs remain
queuedorrunningunexpectedly long - alert if workflow backlog or escalation counts suddenly spike
For a single-node deployment, these can start as operator-watch rules instead of automated paging if the installation is small.
Recommended Manual Watch Surfaces¶
/api/healthzGET /api/v1/projections/health- the operator console projection health cards
- the operator console replay panel
- workflow inbox counts and escalation indicators
- local logs with
request_id,trace_id,tenant_id,projection,job_id, andworkflow_id
What Is Still Out Of Scope¶
Phase 8 does not require:
- hosted multi-tenant SLO segmentation
- external pager integrations
- fleet-wide SRE dashboards
The target is one operator being able to tell whether the self-hosted node is healthy.