ProjectsMarch 15, 20262 min read

Building Medibill Copilot at Chicago Booth

What it looked like to build an evidence-first AI workflow in a domain where confidence without proof is a liability.

One of the most useful things about building Medibill Copilot was that it forced a simple question to stay in front of us: what would make this system trustworthy to the person who actually has to use it?

That question sounds obvious, but it changes the build. Once you stop optimizing for a generic chatbot experience and start optimizing for a consequential workflow, the center of gravity moves. Retrieval quality matters more. Provenance matters more. Workflow state matters more. Quiet failure modes matter much more.

At Booth, I spend a lot of time around frameworks for decision-making under uncertainty. Medibill Copilot gave me a practical arena to apply that mindset. In medical billing, ambiguity is not an abstraction. Documents conflict. Context is partial. The cost of an error lands downstream in operations.

So the goal was never "use AI to answer billing questions." The goal was to build a system that helps a reviewer move faster without hiding where the answer came from.

That is why the architecture leaned into evidence-first extraction and a more explicit workflow. A RAG layer gave the system a way to stay anchored to the right documents. LangGraph gave the workflow durable structure instead of letting everything collapse into one giant prompt. Regression testing gave us a way to keep progress honest over time.

The deeper lesson for me was that trust is a product decision expressed through architecture. You cannot bolt it on later with a reassuring UI. If the system cannot show evidence, reveal uncertainty, and degrade predictably, people will correctly hesitate to rely on it.

That is the kind of AI work I want to keep doing: practical systems that help people operate better because the system is designed to earn belief, not just attention.