
Synthesis: An Autonomous Multi-Agent Research Engine
Overview
Synthesis is what happens when you point a "don't-trust-output-you-can't-trace" instinct straight at language models. The product is the visualization of the agent graph: a planner decomposes your question into sub-questions, a pool of researchers chases each one down in parallel (search → fetch → chunk → embed → retrieve), a synthesizer writes a cited report, and a critic verifies every claim against its sources — all streamed to a live control room over SSE.
Tech Stack
Challenges
- Multiple researchers run at once, but SSE is a single ordered stream — fanning parallel agents into one timeline without races was the biggest time sink.
- A model will happily write "[3]" after any sentence; citations had to be verified against the actual cited text, not taken on faith.
- Knowing when all the concurrent producers were genuinely done — completion under load is deceptively hard.
- Making the offline mode a real path, not a toy — the same orchestration, just with different leaves.
Solution
Agents never touch React; they emit typed RunEvents onto a many-producer / single-consumer event bus, and the SSE route drains it, so the UI is a pure projection of what the graph actually did. Citation integrity is a first-class check — the critic re-reads each claim against the passages it cites, labels confidence (supported / single-source / disputed), and can trigger a bounded revise loop. Every capability (LLM, search, embeddings, vector store) is an interface with a real and a mock implementation, selected at the boundary by env, so the agents only ever see the interface. There's even an MCP server exposing the research tools.
Outcome
Point it at a question and the lanes fill in real time — tool calls landing, sources registering, a report typing itself with inline citations, a per-claim confidence ledger, and an interactive evidence graph — then a shareable /run/<id> URL and a token + USD cost meter at the end. Dangling citations get caught and surfaced, so the confidence labels come from verification, not the model's own self-assessment.
What I'd do differently
SSE is one-way, so "stop" is a client-side AbortController rather than a real server signal — WebSockets would buy genuine mid-run control. The in-memory vector store is perfect for demos but pgvector is the path to anything that survives a restart, and the bounded revise loop is a cost-versus-quality dial, not a guarantee.