
Tracewave: Real-Time Anomaly Detection on a Live Firehose
Overview
Most "data" portfolios show a static notebook. Tracewave shows a live distributed system: ingestion → a stream bus → windowed stream processing → online ML → time-series storage → a real-time frontend, plus self-observability. It maintains rolling windows over the Wikimedia EventStreams firehose, runs three anomaly detectors at once, and slides an anomaly card onto the dashboard with the evidence and a confidence score the instant something unusual spikes.
Tech Stack
Challenges
- The firehose bursts; without bounded buffers it either grows without limit or crashes.
- Three detectors with three different opinions — turning them into one trustworthy confidence score.
- A spike that just says "spike" is useless; each anomaly needs a "why."
- Making the demo never empty, even with no backend wired up.
Solution
The same Processor core runs as one in-memory process in dev or split across Redis + Timescale containers in prod — transport-agnostic by design. Events fold into 1-second tumbling windows on a supplied clock (deterministic, unit-tested), then three online detectors score them: a rolling z-score, an EWMA control chart, and river's multivariate Half-Space Trees. The ensemble rewards agreement, so 3/3 firing corroborates and 1/3 gets suppressed unless it's very strong. For the "why," it keeps a decaying per-dimension baseline and diffs the spiking window against it — so a card reads "+312 edits, ~98% from en.wikipedia.org, namespace 0, bot actors," not just "anomaly."
Outcome
Open the dashboard and it's alive — events flowing, sparklines updating, anomaly cards sliding in with their evidence. Backpressure sheds oldest-first and counts the drops, the ingestor reconnects with backoff and resumes from the last event id, every service exposes Prometheus metrics, and 22 tests cover the windowing and detector math. A deployed link even falls back to an in-browser demo stream so a portfolio link is never a spinner that never resolves.
What I'd do differently
It's the full data-platform stack end to end, which is also its weakness — a lot of moving parts to keep healthy. I'd add a proper schema registry for the firehose sources and lean harder on the source registry, so swapping Wikipedia for another stream is genuinely one entry, not a refactor.