Skip to content

Operations Runbook (SRE/DevOps)

Operational guidance for running Mongoose Server in production. Tailor values to your environment.

Process model

  • Single JVM hosting one or more Mongoose Server instances.
  • Agents (threads) per event source, handler group, and services as configured.
  • Embeddable or standalone modes both supported.

Health and readiness

  • Health checks:
  • Liveness: process is up; agent groups responsive.
  • Readiness: all configured services started; critical feeds connected; backlogs under thresholds.
  • Expose via HTTP or JMX depending on your platform conventions.

Observability

  • Logs:
  • Set INFO for lifecycle (boot, start, stop) and WARN/ERROR for issues.
  • Reduce logging on hot paths (handlers) to avoid perturbing latency.
  • Metrics:
  • Queue depths per event source and per handler group.
  • Publish/consume rates, batch sizes, and idle strategy state.
  • GC metrics, heap usage, allocation rate (should be near‑zero in hot path).
  • Tracing:
  • For cross‑service tracing, tag publish/receive boundaries if leaving the process.

Capacity and performance

  • Tuning knobs:
  • Idle strategies (BusySpin for lowest latency, Yielding/Sleeping for CPU efficiency).
  • Batching levels per pipeline stage.
  • Core pinning of critical agents on Linux with CPU isolation (cset/taskset).
  • Benchmarks:
  • Reproduce benchmark methodology on production‑like hardware; establish latency SLOs.
  • Validate with HdrHistogram distributions; track p50/p90/p99/p99.9 in dashboards.

Backpressure and overload

  • Monitor inbound queue depths; trigger alerts when over thresholds.
  • Apply backpressure via publishers or drop/shape non‑critical traffic where acceptable.
  • Consider lowering batch sizes to reduce latency tails during incidents.

Failure handling

  • Startup failures: fail fast if critical services or feeds do not initialize.
  • Runtime errors in handlers: log with context; consider circuit breakers for flaky integrations.
  • Poison events: capture and route to quarantine sink with metadata.

Deployment checklist

  • JVM flags: consistent flags across environments; enable GC, heap, and JIT logging for diagnostics.
  • CPU isolation (low‑latency tiers): pin agents; isolate cores; disable frequency scaling/turbo if necessary.
  • Containerization: request guaranteed CPU; avoid noisy neighbors for latency‑sensitive agents.
  • Configuration:
  • Verify YAML or programmatic config in CI; keep configs version‑controlled.
  • Keep a compatibility matrix between Mongoose version and plugin versions.

Upgrades and compatibility

  • Versioning: follow semantic versioning; read release notes for breaking changes.
  • Rolling deploy:
  • Standalone: run blue/green or canary instances; drain traffic before shutdown.
  • Embedded: support feature flags to toggle new handlers or plugins.
  • Smoke tests: include functional and latency sanity checks post‑deploy.

Security and compliance (overview)

  • Secrets: inject via env/secret manager; avoid hardcoding in config.
  • Transport: use TLS for external I/O; validate certificates.
  • Access: restrict admin command surface; audit usage.

Troubleshooting quick reference

  • "Handlers not receiving events": confirm feed names and subscription timing; see Troubleshooting and FAQ.
  • "Latency spikes": check idle strategies, CPU isolation, GC logs, logging levels on hot paths.
  • "Backlogs growing": observe queue depths; scale out, adjust batching, or shed load.

References

  • Troubleshooting and FAQ: guide/troubleshooting.md
  • Benchmarks and methodology: reports/server-benchmarks-and-performance.md
  • Threading model: architecture/threading-model.md