All insights
Apr 26, 2026 · 8 min read

OpenTelemetry, Jaeger, Prometheus, Grafana, SLOs — in one sprint

How we went from zero observability to a full stack in a single sprint — without taking the platform down.

The bait

We were running a production Next.js platform on a single Node host behind Cloudflare. Basic alerting via shell scripts existed, but no traces, no charts, no SLO catalogue. Any incident started with tail -f and ended with somebody guessing.

We decided to land the whole stack in one sprint: traces, metrics, dashboards, SLO catalogue, error-budget burn. Here is what happened.

Step 1 — OpenTelemetry SDK

We added a tracing.js that initializes NodeSDK with auto-instrumentations-node. The lesson everybody eventually learns: a tracing.js on disk does not mean OTel is running. It must be loaded before app code via NODE_OPTIONS=--require=./tracing.js in the pm2 ecosystem. Once we wired that, 500+ spans/min started flowing into the collector.

Step 2 — Jaeger

We dropped in jaegertracing/all-in-one. The rookie mistake we almost made: trying to reach Jaeger from the collector via 127.0.0.1. But the two containers were on different docker bridges. We created an hs-obs-net network, attached both, used the hostname hs-jaeger:4318, done.

Step 3 — Prometheus + Grafana

Both already running. The collector's prometheus exporter ships 52 hs_* metrics: duration histograms, eventloop p50/p90/p99, memory usage, backup drills. We provisioned 5 Grafana dashboards via the API: Platform, Runtime, System, Backups, Pipelines (a scaffold for future exporters).

Step 4 — SLO catalogue + burn rates

We wrote infra/slo/definitions.yaml with 6 SLOs (portal availability, p95 latency, admin, API, ingest, replication lag), each carrying its sli_query in PromQL plus multi-window burn-rate alerts in the Google SRE style: 14.4× over 1h, 6× over 6h, 1× over 24h. Then a lib/slo/burn.js that reads the YAML, queries Prometheus, computes remaining error budget, and exposes JSON at /api/slo. The public status page consumes it.

The remaining gap

The http_route label is not filled by default for Next.js Pages Router — the OTel http instrumentation needs a framework hook. We solved it partially with a requestHook that bucketizes the URL into low-cardinality categories (/portal/*, /api/*, etc.). Propagating the label into metrics is a separate follow-up (collector metricstransform).

What we got

  • Searchable traces in Jaeger — we can finally follow a request through HTTP, DB, and Redis layers.
  • Overall p95: ~2.7s. A clear optimization signal for the next sprint.
  • Alertable error budget — we will not lose a month before noticing SLI degradation.
Want this stack on your platform?
We ship it in 5–7 days for most Node / Next.js platforms.
Get in touch