Posted in | News | Medical Robotics

AI Assists Physicians with Stronger Clinical Reasoning

Physician-scientists at Beth Israel Deaconess Medical Center (BIDMC) reported that ChatGPT-4, an artificial intelligence program engineered to comprehend and generate human-like text, surpassed internal medicine residents and attending physicians at two academic medical centers in processing medical data and exhibiting clinical reasoning.

AI, artificial intelligence

Image Credit: everything possible/

The study compared the reasoning abilities of a large language model (LLM) directly against human performance, utilizing standards developed to evaluate physicians. The research was published in JAMA Internal Medicine.

It became clear very early on that LLMs can make diagnoses, but anybody who practices medicine knows there’s a lot more to medicine than that. There are multiple steps behind a diagnosis, so we wanted to evaluate whether LLMs are as good as physicians at doing that kind of clinical reasoning. It’s a surprising finding that these things are capable of showing the equivalent or better reasoning than people throughout the evolution of clinical case.

Adam Rodman MD, Internal Medicine Physician and Investigator, Department of Medicine, Beth Israel Deaconess Medical Center

The revised-IDEA (r-IDEA) score, a previously validated instrument designed to evaluate physicians’ clinical reasoning, was employed by Rodman and colleagues. The researchers recruited 18 residents and 21 attending physicians, each of whom worked through one of 20 carefully chosen clinical cases that included four steps in the diagnostic reasoning process.

The authors instructed physicians to document and provide evidence for each stage of their differential diagnosis. After receiving a prompt with the same instructions, the chatbot GPT-4 processed all 20 clinical cases. After that, several different reasoning metrics were applied to their responses, including the r-IDEA score for clinical reasoning.

The first stage is the triage data when the patient tells you what’s bothering them and you obtain vital signs. The second stage is the system review when you obtain additional information from the patient. The third stage is the physical exam, and the fourth is diagnostic testing and imaging.

Stephanie Cabral MD, Study Lead Author, Beth Israel Deaconess Medical Center

Stephanie Cabral is a third-year internal medicine resident.

The chatbot had the highest r-IDEA scores, according to research by Rodman, Cabral, and colleagues. The LLM received a median score of 10, while attending physicians received a score of 9, and residents received an 8 out of 10. When it came to diagnostic accuracy - that is, how high up the correct diagnosis was on the list of diagnoses they provided - and correct clinical reasoning, it was more of a tie between the humans and the bot.

However, the researchers also discovered that the bots were “just plain wrong” - that is, they frequently provided answers that included instances of erroneous reasoning - much more frequently than residents. The result confirms the prediction that AI will probably be most helpful when used as an addition to human reasoning rather than a replacement.

Cabral said, “Further studies are needed to determine how LLMs can best be integrated into clinical practice, but even now, they could be useful as a checkpoint, helping us make sure we don’t miss something. My ultimate hope is that AI will improve the patient-physician interaction by reducing some of the inefficiencies we currently have and allow us to focus more on the conversation we’re having with our patients.”

Early studies suggested AI could makes diagnoses if all the information was handed to it. What our study shows is that AI demonstrates real reasoning-maybe better reasoning than people through multiple steps of the process. We have a unique chance to improve the quality and experience of healthcare for patients.

Adam Rodman MD, Internal Medicine Physician and Investigator, Department of Medicine, Beth Israel Deaconess Medical Center

Co-authors included Massachusetts General Hospital’s Daniel Restrepo, MD, Brigham and Women’s Hospital’s Raja-Elie Abdulnour, MD, and BIDMC's Zahir Kanjee, MD, Philip Wilson, MD, and Byron Crowe, MD.

This study received support from Harvard Catalyst | The Harvard Clinical and Translational Science Center (National Center for Advancing Translational Sciences, National Institutes of Health) (award UM1TR004408), as well as financial contributions from Harvard University and its affiliated academic healthcare centers.


Tell Us What You Think

Do you have a review, update or anything you would like to add to this news story?

Leave your feedback
Your comment type

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.