According to a new study published in JAMA Network Open, doctors who used ChatGPT did not perform better when making a diagnosis compared to doctors who only used conventional resources. However, ChatGPT alone performed significantly better than both groups of doctors.
For the study, 50 physicians, which included 26 attendings and 24 residents, were given six case histories selected from a larger set of 105 real cases. These cases have been used by researchers since the 1990s but have never been published, meaning the physicians would not have seen them beforehand and OpenAI's artificial intelligence (AI) chatbot ChatGPT could not have been trained on them.
During the study, the doctors were asked to come up with diagnoses for as many of the six cases as they could within an hour. The doctors were asked for three possible diagnoses, along with supporting evidence for each. They were also asked to provide any findings that do not support each diagnosis or findings that were expected but not present in each case.
Half of the physicians were randomly assigned to use ChatGPT alongside traditional resources while the other half were only allowed to use traditional resources, like UpToDate, an online system with clinical information. None of the doctors were given any explicit training on using ChatGPT.
Overall, the researchers found that doctors who used ChatGPT and those who didn't performed similarly. Doctors who used ChatGPT had a median score of 76% for making a diagnosis and explaining a reason for it. In comparison, the doctors who only used traditional resources had a median score of 74%.
Surprisingly, ChatGPT on its own outperformed both groups of doctors, receiving a median score of 90% for making a diagnosis and providing a reason for it.
"The chat interface is the killer app," said Jonathan Chen, a physician and computer scientist at Stanford University and one of the study's authors. "We can pop a whole case into the computer. Before a couple years ago, computers did not understand language."
According to the researchers, the study's findings were not what they expected.
"We were all shocked," said Ethan Goh, a postdoctoral fellow at the Stanford Clinical Excellence Research Center and the study's first co-author. "There's a fundamental theorem that AI plus [humans] or computer plus humans should always do better than humans alone."
However, the researchers emphasized that even though ChatGPT performed better than human doctors, this doesn't mean AI should be used to diagnose illnesses without a doctor's oversight. Notably, the significance of the research is limited by the fact that the cases were simulated, even though they were based on real patient data.
"All the information was prepared in a way that doesn't mimic real life," Goh said.
One reason why doctors using ChatGPT did not perform better was that they did not have formal training on how to best use the technology. Many of the doctors in the study treated ChatGPT as a search engine rather than asking it to make a diagnosis on its own and going from there.
"It was only a fraction of the doctors who realized they could literally copy-paste in the entire case history into the chatbot and just ask it to give a comprehensive answer to the entire question," Chen said. "Only a fraction of doctors actually saw the surprisingly smart and comprehensive answers the chatbot was capable of producing."
According to Goh, explicit curriculum for physicians on how to use AI tools, as well as instructions on the technology's potential downsides, could help doctors use chatbots more effectively when making diagnoses.
Another reason why the doctors using ChatGPT did not perform better is bias toward their diagnoses. Even if the chatbot presented them with new or conflicting information that could lead to a different diagnosis, the doctors may have been hesitant to change their minds.
"They didn't listen to A.I. when A.I. told them things they didn't agree with," said Adam Rodman, an expert in internal medicine at Beth Israel Deaconess Medical Center who helped design the study.
Going forward, Goh said those factors, such as a lack of appropriate training or bias, should be studied to determine whether they make a difference in doctors' diagnoses with AI technology. There are also more questions for physicians to answer after a diagnosis, and AI could potentially help in the future.
"What are the correct treatment steps to take?" Goh said. "What are the tests and such to order that would help you guide the patient towards what to do next?"
"All the information was prepared in a way that doesn't mimic real life," Goh said.
One reason why doctors using ChatGPT did not perform better was that they did not have formal training on how to best use the technology. Many of the doctors in the study treated ChatGPT as a search engine rather than asking it to make a diagnosis on its own and going from there.
"It was only a fraction of the doctors who realized they could literally copy-paste in the entire case history into the chatbot and just ask it to give a comprehensive answer to the entire question," Chen said. "Only a fraction of doctors actually saw the surprisingly smart and comprehensive answers the chatbot was capable of producing."
According to Goh, explicit curriculum for physicians on how to use AI tools, as well as instructions on the technology's potential downsides, could help doctors use chatbots more effectively when making diagnoses.
Another reason why the doctors using ChatGPT did not perform better is bias toward their diagnoses. Even if the chatbot presented them with new or conflicting information that could lead to a different diagnosis, the doctors may have been hesitant to change their minds.
For more insights in AI and healthcare, check out these Advisory Board resources:
"They didn't listen to A.I. when A.I. told them things they didn't agree with," said Adam Rodman, an expert in internal medicine at Beth Israel Deaconess Medical Center who helped design the study.
Going forward, Goh said those factors, such as a lack of appropriate training or bias, should be studied to determine whether they make a difference in doctors' diagnoses with AI technology. There are also more questions for physicians to answer after a diagnosis, and AI could potentially help in the future.
"What are the correct treatment steps to take?" Goh said. "What are the tests and such to order that would help you guide the patient towards what to do next?"
(Kolata, New York Times, 11/17; Somasundaram, Washington Post, 11/19; Goh et al., JAMA Network Open, 10/28)
By John League and Sarah Roller
Headlines that say "AI defeats doctors" ignore what is really going on in this study and the obstacles to realizing AI's clinical value for physicians.
Yes, physicians struggled to use AI to improve their unassisted diagnostic performance, and the use of AI didn't show any time savings. But those aren't permanent conditions. The looming challenges of physician shortages, data and information deluge, and patient access only amplify the need for better understanding of how physicians can make best use of this technology.
This study is somewhat surprising because it conflicts with previous research on the performance of physicians assisted by AI to create differential diagnoses. A 2023 study showed that when using a large-language model (LLM) adapted for diagnostic reasoning, clinicians assisted by the LLM had 51.7% diagnostic accuracy, significantly outperforming clinicians who used standard search engines (44.4%) and unassisted clinicians (36.7%). Clinicians who used the LLM also generated "more comprehensive differential lists than those without its assistance."
Taken together, these studies tell us that, rather than declaring AI superior to physicians' diagnostic capabilities, organizations must better enable physicians to utilize and embrace this evolving technology. We have identified two opportunities for provider organizations, digital health companies, and medical training organizations to better support physicians in maximizing the value of LLM.
1. Physician training must include training on the use of AI.
One of the biggest keys to the performance of a LLM on any task is the strength of the entry prompt. In this case, the prompt given to the LLM when measuring its standalone diagnostic performance was tailored to increase the chances that the LLM would perform well.
But that's not how the doctors in the study used the LLM. Most physicians in the study treated the LLM like a search function, much as they would if searching on Google or UptoDate. Rather than copy-and-pasting the entire vignette — a best practice that was included in the standalone prompt — physicians prompted the LLM to respond to questions about individual characteristics extracted from each vignette. This limited the value of the LLM to physicians, and likely contributed to the lack of time savings between the two physician groups.
This underscores the need for physician education on how to best leverage LLM technology — both in prompt design and efficiency. We would never expect someone who didn’t know how to use a tool to create quality work with that tool, much less a tool used in a dynamic and complex clinical setting. The study authors suggest that organizations should "invest in predefined prompting for diagnostic decision support integrated into clinical workflows and documentation, enabling synergy between the tools and clinicians" — and we agree.
Importantly, we see a role for organizations across the healthcare ecosystem to play a role in this training. It should begin in medical education but must continue into physician practice, as the technology will continue to evolve. This creates an opportunity for digital health and health IT vendors to partner with provider organizations to better train physicians on their tools.
2. Physicians must adopt a broader mindset.
Even with prompt engineering training, significant adaptive challenges requiring a broader mindset shift remain to get physicians to accept — or even to engage with — recommendations from AI.
Medical education often emphasizes diagnosis as a primary role of physicians — and one that distinguishes the physician from others on the care team. Over time, this creates a cycle where physicians' experience and intuition rise above other factors in making diagnoses. We see this play out in this study. When offered an alternative diagnosis that conflicted with their own diagnosis, physicians "didn't listen to AI when AI told them things they didn't agree with, " according to The New York Times.
At the end of the day, embracing assistive technology like AI will require a broader identity shift from physicians around their role and their limitations. Many of the AI applications currently in use and development focus on easing or even eliminating administrative tasks. Those have value, but the ceiling on their ability to help organizations confront workforce and access challenges at scale is relatively low.
There is a much higher ceiling on the potential for AI to augment and improve clinical decision making, especially for physicians. Healthcare approaches to AI should focus on finding the upper limit of that ceiling, both technically and adaptively. And provider organizations must start building the infrastructure now to support their physicians in embracing these shifts.
Create your free account to access 1 resource, including the latest research and webinars.
You have 1 free members-only resource remaining this month.
1 free members-only resources remaining
1 free members-only resources remaining
Never miss out on the latest innovative health care content tailored to you.