AI outdiagnosed ER doctors in a Harvard study — but no one knows who's liable
A peer-reviewed study published in Science found that OpenAI's o1 language model outperformed emergency room physicians on diagnostic accuracy — scoring 67% at triage compared to 55% and 50% for two attending internists. The research was led by Harvard Medical School and Beth Israel Deaconess Medical Center, with Stanford collaborators, and used unprocessed real-world patient records. For anyone who has waited hours in an ER for a working diagnosis, the gap is hard to ignore.
The study
Researchers tested o1 and GPT-4o against 76 real ER cases from Beth Israel. Two internists worked through the same cases simultaneously. Afterward, two separate physicians — blinded to the source — rated the quality of each diagnosis. The AI and the human doctors worked from identical electronic health records: no pre-processing, no shortcuts.
O1 matched or beat both physicians at every stage of the diagnostic process. The largest gap appeared at triage — the moment when the least information is available and decisions are most time-pressured. At that point, o1 landed an accurate or near-accurate diagnosis in 67.1% of cases, per the Science paper. The model also showed an edge on rare-disease identification and complex management reasoning, including antibiotic selection and end-of-life care decisions.
The authors are careful to say this does not mean AI is ready to practice medicine independently. They call for prospective clinical trials before any real-world deployment.
The gap no one has filled
The results land in a market that is already moving fast — and without guardrails. Around 20% of US clinicians were already consulting large language models for second opinions as of a 2025 Elsevier study, according to Harvard Magazine. That number has almost certainly grown.
The FDA cleared a record 295 AI and machine-learning medical devices in 2025, but roughly 75% are radiology imaging tools (computer-aided detection), per the 2025 AI/ML clearances review. Diagnostic reasoning from a language model — the kind o1 demonstrated — has no established regulatory pathway yet.
Liability is the other open question. The study itself does not address who bears responsibility when an AI recommendation diverges from a physician's judgment and something goes wrong. Malpractice frameworks were written for human clinicians. No federal rule currently assigns fault in that scenario.
What comes next
The study is a proof of concept, not a deployment plan. O1 worked on text-only data; it has not been tested alongside imaging, vital signs, or the multimodal inputs that define most real ER workflows. Controlled prospective trials — the kind that would actually move regulators — have not been announced.
Still, the numbers are real. A model trained on general internet text matched or exceeded specialist physicians on 76 real emergency cases. The clinical question is no longer whether AI can diagnose. The harder question is what happens when it does.