
Blog
/
Jul 24, 2025
Written by: Muthu Alagappan, Tony Sun, Jessica Fan
Earlier this month, Microsoft published “The Path to Medical Superintelligence,” a compelling look at what it might take to build an AI system capable of matching or exceeding expert physician judgment. The approach is novel: Microsoft went beyond assessing whether LLMs could accurately answer medical questions in a black-box, and instead fleshed out a more realistic simulation of a diagnostic pipeline. Their design was composed of five distinct AI agents acting as a virtual care team, proposing differential diagnoses, debating diagnostic actions, and navigating budget constraints to simulate real-world clinical pressures.
It’s one of the most ambitious designs to date, and a valuable proof of concept for where LLMs are headed. But as a physician building clinical AI at Counsel, we approach these developments with a more grounded question: What does it take for a system like this to actually help patients in the real world?
Accuracy is necessary, but it’s not the only thing that matters. In our experience deploying medical AI into live patient workflows, three constraints consistently limit whether technically accurate systems translate to patient care:
Latency
Microsoft’s multi-agent framework relies on its constituent agents to individually reason and vote on the next action. It’s a smart structure, but decision-by-committee is only as fast as the slowest agent’s response. Many simple, routine queries like evaluating a dry cough can take several minutes to resolve with multi-agent frameworks. That kind of delay is a non-starter in actual asynchronous care, where timely responses drive both patient satisfaction and clinical throughput.
At Counsel, we’ve built our AI systems to minimize response latency without sacrificing medical quality. Responsiveness is foundational to usability, especially in high-volume settings, so for us, latency is not only a performance metric, but also a clinical requirement.
Workflow Integration
Clinical AI can’t be a tool that clinicians use outside of their existing workflows. Systems that require switching to a separate platform might work in academia, but won’t translate to busier patient care settings. Tools that live outside of an existing clinical workflow won’t be adopted at scale.
That’s why Counsel’s Clinician Cockpit is built as an embedded layer within our homegrown EHR. Our providers don’t need to toggle between platforms to manage patient care. Instead, all relevant patient data, such as labs, medications, imaging, symptoms, and chronic conditions, are available in a single view. After-visit notes flow directly back into the Health Information Exchange, keeping the broader care team aligned.
Benchmark vs Reality
Microsoft’s evaluation relies heavily on NEJM challenge cases, which are complex, rare diseases intended to test the upper bounds of diagnostic reasoning. While impressive, these cases are not representative of the questions that drive daily volume in virtual care. In the real world, most patients aren’t presenting with a rare set of symptoms that would require lymphoma testing and bloodwork to get right. Instead, they’re reaching out about common, routine complaints: a cough that won’t go away, URI symptoms, abdominal pain, or medication side effects.
These are the types of cases we see most at Counsel, so our models are trained and evaluated on these real-world distributions. A system optimized for the NEJM challenge set may struggle to efficiently triage for these everyday workflows.
This study is innovative and a milestone in clinical LLM development, but when it comes to deploying medical AI in the real world, accuracy is only half the equation. Practicality, latency, integration, and benchmark alignment define whether a system will be used, trusted, and ultimately drive better patient outcomes.