
Blog
/
Aug 26, 2025
Written by: Tony Sun
Can AI Be Trusted In A Medical Emergency?
When someone describes sudden chest pain, shortness of breath, or a neurological change, clinicians know time and speed are critical. So, can AI systems be trusted to recognize those scenarios and escalate appropriately? That’s the question Counsel’s AI Research team sought to answer using HealthBench Consensus, the first large-scale open-source benchmark for evaluating medical reasoning across emergency and non-emergency situations. With HealthBench Consensus, we were able to benchmark Counsel’s AI escalatory behavior against other leading models to see how well each performs in conversational triage.
Dataset Description and Processing
HealthBench is an open-source dataset of 5,000 synthetic healthcare scenarios. Each scenario is paired with physician-written rubrics for evaluation. In total, 262 physicians authored more than 48,000 rubrics, which created a rich but uneven set of criteria, since styles and standards varied widely across annotators.
To reduce variability, we focused only on consensus rubrics, where at least two or more physicians independently agreed. This yielded 34 rubrics covering 3,671 scenarios. We refer to this as the HealthBench Consensus dataset.
Within that dataset, we isolated scenarios specifically related to emergency escalation. In the HealthBench framework (see Appendix I of HealthBench), physicians categorized 453 prompts as emergency, conditional-emergency, or non-emergency:
Emergency: Conversations that reflect that the user should immediately seek urgent medical care (e.g., go to the ER, call 911, or equivalent).
Conditional-emergencies: Conversations where it is possible that the user should immediately seek urgent medical care (e.g., go to the ER, call 911, or equivalent), depending on a risk factor or other circumstances not present in the conversation.
Non-emergency: Conversations that reflect that the user should not seek emergency level medical care immediately, but may need to seek medical care in a different time frame or setting.
To align the benchmark with Counsel’s real-world scope, we applied additional filters:
Removed non-English prompts (leaving 433).
Excluded conditional-emergency cases, since Counsel AI may have access to additional patient context from health records in production (leaving 261).
Excluded second-hand prompts (e.g., “my patient,” “my child,” “my roommate”) that are outside Counsel’s use case.
After these refinements, we arrived at 103 focused and high-quality scenarios that we could test emergency escalation on.
Methods
We benchmarked Counsel’s orchestrator triage model (Counsel AI) against leading foundational models: OpenAI’s gpt-4.1-2025-04-14
and o3
models, and Claude’s opus-4
and sonnet-4
models.
As noted in the OpenAI publication, for OpenAI’s gpt-4.1
model and Claude’s opus-4
and sonnet-4
models, we used the default OpenAI system prompt (”you are a helpful assistant”). Reasoning models like o3
don’t have system prompts, so we left that blank.
Counsel AI, by contrast, was evaluated using its dedicated escalation pipeline, designed for real-world triage. For comparability, we assumed a standard 35-year-old male patient with no prior medical history.
Escalation
Counsel AI outputs an explicit escalation flag when it determines a case is an emergency.
Foundation models only return free-text outputs, so we applied a separate LLM-as-judge (
gpt-4.1
) to classify responses as “escalate” or “do not escalate.”A subset of these classifications was validated by clinicians to confirm accuracy of the LLM-judge.
Results
Across HealthBench scenarios, Counsel’s AI correctly escalated all emergency escalations with fewer false negatives than any other model tested. Counsel's AI escalation approach is significantly more precise compared to other foundational models, which translates into the highest F1 score across models.

Figure 1. On HealthBench Consensus emergency cases, all models achieved 100% recall, but Counsel AI showed markedly higher precision and the best overall F1 score.
What we discovered is that foundation models are trained for health safety, and tend to “over-escalate.” While that avoids the risk of missing emergencies, it produces a flood of false alarms, recommending ER visits that drive unnecessary patient stress, higher costs, and added strain on already overloaded emergency departments.
Counsel AI struck a more appropriate balance. It caught every true emergency, but avoided sending patients to acute care when it wasn’t warranted.
Conclusion
Our analysis shows that Counsel AI not only matches foundation models in safety (recall), but also surpasses them in discernment (precision). In the context of emergency triage, that distinction matters. Fewer false alarms means less unnecessary utilization and more trust in AI-enabled care.
While this is a strong early result, it is not the final step. By design, HealthBench scenarios are synthetic and at times simplified. Further research work is required to accurately benchmark conditional emergencies in the async care setting. The next stage of our research will compare Counsel AI’s escalation performance against human clinicians on real-world conversations, as well as to release our own version of a conversational dataset.
At Counsel, we believe AI should not replace clinicians but amplify human judgment in the moments that matter most. HealthBench provides external validation of that vision, showing that when designed for safety and specificity, AI can be trusted, even in emergencies.