Medibill Copilot
An AI-powered legal navigator helping low-income patients fight medical debt — turning confusing bills, collection letters, and court documents into actionable remedies with citations to the exact statutes that protect them.
The problem
Medical debt is the most common civil legal problem facing low-income Americans. The patients who need help the most — those facing debt collectors, civil lawsuits, wage garnishment — are also the least equipped to navigate a system designed to be opaque. Charity care programs exist. Patient rights under FDCPA and state law are real and enforceable. Most people never learn about them in time.
Medibill Copilot was built to close that gap. The mission: take a patient's documents — a hospital bill, an Explanation of Benefits, a collection notice, a court summons, a writ of execution — and turn them into a clear picture of what is wrong, what rights apply, and what to do next. For free.
What it does
A patient uploads their documents. The system determines which of five situation tracks applies — billing dispute, debt collection, active lawsuit, small claims court, or post-judgment enforcement — and routes through the appropriate analysis pipeline. On the other end, the patient receives:
- A line-item reconciliation of their bill against their EOB, flagging amount mismatches, duplicate charges, upcoding, and CPT code issues
- An analysis of their legal rights under FDCPA, FCRA, WA consumer debt law (RCW 19.16), and the No Surprises Act
- Charity care eligibility under Washington's RCW 70.170 (free at below 200% FPL, reduced to 300%)
- A draft letter — dispute, cease-contact, charity care request, or affirmative defense summary — with citations to the specific statutes it invokes
- For lawsuits: identification of counterclaims (FDCPA 15 USC 1692k, WA CPA triple damages) and affirmative defenses
- For garnishments: exemption analysis, hardship calculation, and a structured attorney brief package ready for a legal aid referral
Architecture
Five tracks, three compiled graphs
The pipeline is built on LangGraph. Rather than one monolithic graph, three compiled graphs handle different phases:
- Triage graph — intake + parse. Runs on document upload. Classifies the case track and extracts billing codes, provider identifiers, and key dates before the user submits income data.
- Analysis graph — the full track-specific pipeline. Each track runs a different sequence; billing runs 6 agents, post-judgment runs 7.
- Draft graph — draft + verify only, for reruns when documents have not changed.
Sixteen agents run in total across the five tracks. Each has a single bounded responsibility and its own system prompt. The rights agent only analyzes consumer debt law. The reconcile agent only compares billing line items. The draft agent only generates letters from KB-verified templates. The verify agent checks everything before delivery.
Deterministic routing for post-judgment cases
The LLM is constrained to four tracks during intake. The fifth track — post-judgment — is assigned by deterministic keyword matching on document types and filenames in Python, not by the model. This was introduced as a regression fix after the LLM confused attorney threat letters mentioning "we may seek default judgment" with actual writs of execution. Hard code with unit test coverage is more reliable than a prompt instruction when the cost of misclassification is high.
Hallucination prevention for legal citations
After every LLM call, the system builds a set of valid KB chunk IDs from the actual retrieval and recursively walks the entire output to remove any citation whose reference ID is not in that set. Invented citations are stripped before they reach the user. A wrong statute reference in a legal letter is not just useless — it is actively harmful to someone trying to use it.
Dual-model RAG
Legal documents are embedded with voyage-law-2, a legal-domain fine-tuned model. General documents use voyage-3. Mixed-category queries — for example, an agent that needs both statute text and access-to-justice research data — run two parallel vector searches and merge results using Reciprocal Rank Fusion before BM25 fusion. This prevents semantic mismatch from querying law-specialized vectors with a general-purpose model.
Two evidence channels
Every agent prompt enforces a separation between evidence (facts extracted from the patient's documents, with doc ID, filename, page, and verbatim quote) and kb_citations (legal authority from the knowledge base). Agents are explicitly instructed never to present survey statistics as legal authority. This distinction flows through to how findings are displayed and how the verify agent cross-checks outputs.
Tech stack
| Layer | Technology |
|---|---|
| Agent orchestration | LangGraph 0.4 |
| LLM | Anthropic Claude (claude-sonnet) |
| Embeddings + reranking | Voyage AI — voyage-law-2 + voyage-3 |
| Backend | FastAPI + PostgreSQL with pgvector |
| Document processing | pypdf, pdf2image, Tesseract OCR |
| Storage | MinIO (local) / Cloudflare R2 (staging) |
| Frontend | React + TypeScript + Tailwind CSS |
| Deployment | Fly.io (backend) + Vercel (frontend) |
Testing
334 unit tests, no API keys or running server required. Coverage includes:
- The deterministic post-judgment routing override — every combination that should and should not trigger it, including a named regression case for attorney threat letters
- Field contracts for each agent's output: findings accumulate rather than replace, every optional field handles missing gracefully
- CaseState TypedDict coverage across all 16 pipeline fields
- End-to-end regression manifests for all five tracks using real training documents and synthetic consumer scenarios
- Eight gray-area edge cases: duplicate accounts from multiple collectors, SOL boundary dates, bankruptcy automatic stay during active collection, post-judgment bankruptcy discharge, mistaken identity from similar names
What building it taught me
The central design tension was power versus trust. The system can do a great deal — identify FDCPA violations, flag surprise billing breaches, analyze garnishment math, draft letters invoking specific statutes. But the people using it are in stressful, high-stakes situations. They cannot independently verify a legal citation. They are relying on the system.
That shifted every architecture decision from "can the system produce good output" to "can the user trust what the system produces." Deterministic routing, citation stripping, evidence channel separation, the verify agent — each of those is an answer to the second question, not the first.
The Booth framing that applied most directly was about decision-making under uncertainty and what it means to design for a user who cannot audit your output. Trust is not something you add to the interface at the end. It is a property of the architecture. You build it in or you do not have it.
This is the kind of AI work I want to keep doing: systems that help people in genuinely difficult situations, built in a way that earns belief rather than just producing confident text.