Operations Runbook (SRE/DevOps)¶

Operational guidance for running Mongoose Server in production. Tailor values to your environment.

Process model¶

Health checks:
Liveness: process is up; agent groups responsive.
Readiness: all configured services started; critical feeds connected; backlogs under thresholds.
Expose via HTTP or JMX depending on your platform conventions.

Logs:
Set INFO for lifecycle (boot, start, stop) and WARN/ERROR for issues.
Reduce logging on hot paths (handlers) to avoid perturbing latency.
Metrics:
Queue depths per event source and per handler group.
Publish/consume rates, batch sizes, and idle strategy state.
GC metrics, heap usage, allocation rate (should be near‑zero in hot path).
Tracing:
For cross‑service tracing, tag publish/receive boundaries if leaving the process.

Tuning knobs:
Idle strategies (BusySpin for lowest latency, Yielding/Sleeping for CPU efficiency).
Batching levels per pipeline stage.
Core pinning of critical agents on Linux with CPU isolation (cset/taskset).
Benchmarks:
Reproduce benchmark methodology on production‑like hardware; establish latency SLOs.
Validate with HdrHistogram distributions; track p50/p90/p99/p99.9 in dashboards.

Monitor inbound queue depths; trigger alerts when over thresholds.
Apply backpressure via publishers or drop/shape non‑critical traffic where acceptable.
Consider lowering batch sizes to reduce latency tails during incidents.

Startup failures: fail fast if critical services or feeds do not initialize.
Runtime errors in handlers: log with context; consider circuit breakers for flaky integrations.
Poison events: capture and route to quarantine sink with metadata.

JVM flags: consistent flags across environments; enable GC, heap, and JIT logging for diagnostics.
CPU isolation (low‑latency tiers): pin agents; isolate cores; disable frequency scaling/turbo if necessary.
Containerization: request guaranteed CPU; avoid noisy neighbors for latency‑sensitive agents.
Configuration:
Verify YAML or programmatic config in CI; keep configs version‑controlled.
Keep a compatibility matrix between Mongoose version and plugin versions.

Versioning: follow semantic versioning; read release notes for breaking changes.
Rolling deploy:
Standalone: run blue/green or canary instances; drain traffic before shutdown.
Embedded: support feature flags to toggle new handlers or plugins.
Smoke tests: include functional and latency sanity checks post‑deploy.

"Handlers not receiving events": confirm feed names and subscription timing; see Troubleshooting and FAQ.
"Latency spikes": check idle strategies, CPU isolation, GC logs, logging levels on hot paths.
"Backlogs growing": observe queue depths; scale out, adjust batching, or shed load.