RECALIBRATE YOUR HEALTHCARE STRATEGY
Learn 4 strategic pivots for 2025 and beyond.
Learn more

Daily Briefing

Doctors vs. AI: Who is better at making diagnoses?


According to a new study published in JAMA Network Open, doctors who used ChatGPT did not perform better when making a diagnosis compared to doctors who only used conventional resources. However, ChatGPT alone performed significantly better than both groups of doctors.

Study details and key findings

For the study, 50 physicians, which included 26 attendings and 24 residents, were given six case histories selected from a larger set of 105 real cases. These cases have been used by researchers since the 1990s but have never been published, meaning the physicians would not have seen them beforehand and OpenAI's artificial intelligence (AI) chatbot ChatGPT could not have been trained on them.

During the study, the doctors were asked to come up with diagnoses for as many of the six cases as they could within an hour. The doctors were asked for three possible diagnoses, along with supporting evidence for each. They were also asked to provide any findings that do not support each diagnosis or findings that were expected but not present in each case.

Half of the physicians were randomly assigned to use ChatGPT alongside traditional resources while the other half were only allowed to use traditional resources, like UpToDate, an online system with clinical information. None of the doctors were given any explicit training on using ChatGPT.

Overall, the researchers found that doctors who used ChatGPT and those who didn't performed similarly. Doctors who used ChatGPT had a median score of 76% for making a diagnosis and explaining a reason for it. In comparison, the doctors who only used traditional resources had a median score of 74%.

Surprisingly, ChatGPT on its own outperformed both groups of doctors, receiving a median score of 90% for making a diagnosis and providing a reason for it.

"The chat interface is the killer app," said Jonathan Chen, a physician and computer scientist at Stanford University and one of the study's authors. "We can pop a whole case into the computer. Before a couple years ago, computers did not understand language."

Commentary

According to the researchers, the study's findings were not what they expected.

"We were all shocked," said Ethan Goh, a postdoctoral fellow at the Stanford Clinical Excellence Research Center and the study's first co-author. "There's a fundamental theorem that AI plus [humans] or computer plus humans should always do better than humans alone."

However, the researchers emphasized that even though ChatGPT performed better than human doctors, this doesn't mean AI should be used to diagnose illnesses without a doctor's oversight. Notably, the significance of the research is limited by the fact that the cases were simulated, even though they were based on real patient data.

"All the information was prepared in a way that doesn't mimic real life," Goh said.

One reason why doctors using ChatGPT did not perform better was that they did not have formal training on how to best use the technology. Many of the doctors in the study treated ChatGPT as a search engine rather than asking it to make a diagnosis on its own and going from there.

"It was only a fraction of the doctors who realized they could literally copy-paste in the entire case history into the chatbot and just ask it to give a comprehensive answer to the entire question," Chen said. "Only a fraction of doctors actually saw the surprisingly smart and comprehensive answers the chatbot was capable of producing."

According to Goh, explicit curriculum for physicians on how to use AI tools, as well as instructions on the technology's potential downsides, could help doctors use chatbots more effectively when making diagnoses.

Another reason why the doctors using ChatGPT did not perform better is bias toward their diagnoses. Even if the chatbot presented them with new or conflicting information that could lead to a different diagnosis, the doctors may have been hesitant to change their minds.

"They didn't listen to A.I. when A.I. told them things they didn't agree with," said Adam Rodman, an expert in internal medicine at Beth Israel Deaconess Medical Center who helped design the study.

Going forward, Goh said those factors, such as a lack of appropriate training or bias, should be studied to determine whether they make a difference in doctors' diagnoses with AI technology. There are also more questions for physicians to answer after a diagnosis, and AI could potentially help in the future.

"What are the correct treatment steps to take?" Goh said. "What are the tests and such to order that would help you guide the patient towards what to do next?" 

"All the information was prepared in a way that doesn't mimic real life," Goh said.

One reason why doctors using ChatGPT did not perform better was that they did not have formal training on how to best use the technology. Many of the doctors in the study treated ChatGPT as a search engine rather than asking it to make a diagnosis on its own and going from there.

"It was only a fraction of the doctors who realized they could literally copy-paste in the entire case history into the chatbot and just ask it to give a comprehensive answer to the entire question," Chen said. "Only a fraction of doctors actually saw the surprisingly smart and comprehensive answers the chatbot was capable of producing."

According to Goh, explicit curriculum for physicians on how to use AI tools, as well as instructions on the technology's potential downsides, could help doctors use chatbots more effectively when making diagnoses.

Another reason why the doctors using ChatGPT did not perform better is bias toward their diagnoses. Even if the chatbot presented them with new or conflicting information that could lead to a different diagnosis, the doctors may have been hesitant to change their minds.

Advisory Board's AI resources

For more insights in AI and healthcare, check out these Advisory Board resources:

  • This expert insight outlines different strategies for AI adoption, which can help you decide the best approach for your organization. Similarly, this expert insight explains how health system executives currently approach AI in healthcare.
  • To use AI effectively, this infographic explains how to overcome AI challenges to unlock the technology's full potential while this field guide outlines how to take a problem-first approach to AI.
  • You can also read our research on the use of AI in different areas of healthcare, including cardiovascular care and imaging.
  • We also have a featured page on AI, which includes research on how to mitigate challenges with the technology, how leaders should approach investments in AI tools, and more.

"They didn't listen to A.I. when A.I. told them things they didn't agree with," said Adam Rodman, an expert in internal medicine at Beth Israel Deaconess Medical Center who helped design the study.

Going forward, Goh said those factors, such as a lack of appropriate training or bias, should be studied to determine whether they make a difference in doctors' diagnoses with AI technology. There are also more questions for physicians to answer after a diagnosis, and AI could potentially help in the future.

"What are the correct treatment steps to take?" Goh said. "What are the tests and such to order that would help you guide the patient towards what to do next?"

(Kolata, New York Times, 11/17; Somasundaram, Washington Post, 11/19; Goh et al., JAMA Network Open, 10/28)


Advisory Board's take

2 ways to prepare doctors for an AI-assisted future

By John League and Sarah Roller

Headlines that say "AI defeats doctors" ignore what is really going on in this study and the obstacles to realizing AI's clinical value for physicians.

Yes, physicians struggled to use AI to improve their unassisted diagnostic performance, and the use of AI didn't show any time savings. But those aren't permanent conditions. The looming challenges of physician shortages, data and information deluge, and patient access only amplify the need for better understanding of how physicians can make best use of this technology.

This study is somewhat surprising because it conflicts with previous research on the performance of physicians assisted by AI to create differential diagnoses. A 2023 study showed that when using a large-language model (LLM) adapted for diagnostic reasoning, clinicians assisted by the LLM had 51.7% diagnostic accuracy, significantly outperforming clinicians who used standard search engines (44.4%) and unassisted clinicians (36.7%). Clinicians who used the LLM also generated "more comprehensive differential lists than those without its assistance."

Clinicians who used the LLM also generated “more comprehensive differential lists than those without its assistance.”

Taken together, these studies tell us that, rather than declaring AI superior to physicians' diagnostic capabilities, organizations must better enable physicians to utilize and embrace this evolving technology. We have identified two opportunities for provider organizations, digital health companies, and medical training organizations to better support physicians in maximizing the value of LLM.

1. Physician training must include training on the use of AI.

One of the biggest keys to the performance of a LLM on any task is the strength of the entry prompt. In this case, the prompt given to the LLM when measuring its standalone diagnostic performance was tailored to increase the chances that the LLM would perform well.

But that's not how the doctors in the study used the LLM. Most physicians in the study treated the LLM like a search function, much as they would if searching on Google or UptoDate. Rather than copy-and-pasting the entire vignette — a best practice that was included in the standalone prompt — physicians prompted the LLM to respond to questions about individual characteristics extracted from each vignette. This limited the value of the LLM to physicians, and likely contributed to the lack of time savings between the two physician groups.

This underscores the need for physician education on how to best leverage LLM technology — both in prompt design and efficiency. We would never expect someone who didn’t know how to use a tool to create quality work with that tool, much less a tool used in a dynamic and complex clinical setting. The study authors suggest that organizations should "invest in predefined prompting for diagnostic decision support integrated into clinical workflows and documentation, enabling synergy between the tools and clinicians" — and we agree.

Importantly, we see a role for organizations across the healthcare ecosystem to play a role in this training. It should begin in medical education but must continue into physician practice, as the technology will continue to evolve. This creates an opportunity for digital health and health IT vendors to partner with provider organizations to better train physicians on their tools.

2. Physicians must adopt a broader mindset.

Even with prompt engineering training, significant adaptive challenges requiring a broader mindset shift remain to get physicians to accept — or even to engage with — recommendations from AI.

Medical education often emphasizes diagnosis as a primary role of physicians — and one that distinguishes the physician from others on the care team. Over time, this creates a cycle where physicians' experience and intuition rise above other factors in making diagnoses. We see this play out in this study. When offered an alternative diagnosis that conflicted with their own diagnosis, physicians "didn't listen to AI when AI told them things they didn't agree with, " according to The New York Times.

At the end of the day, embracing assistive technology like AI will require a broader identity shift from physicians around their role and their limitations. Many of the AI applications currently in use and development focus on easing or even eliminating administrative tasks. Those have value, but the ceiling on their ability to help organizations confront workforce and access challenges at scale is relatively low.

There is a much higher ceiling on the potential for AI to augment and improve clinical decision making, especially for physicians. Healthcare approaches to AI should focus on finding the upper limit of that ceiling, both technically and adaptively. And provider organizations must start building the infrastructure now to support their physicians in embracing these shifts.


Related Resources

Don't miss out on the latest Advisory Board insights

Create your free account to access 1 resource, including the latest research and webinars.

Want access without creating an account?

   

You have 1 free members-only resource remaining this month.

1 free members-only resources remaining

1 free members-only resources remaining

You've reached your limit of free insights

Become a member to access all of Advisory Board's resources, events, and experts

Never miss out on the latest innovative health care content tailored to you.

Benefits include:

Unlimited access to research and resources
Member-only access to events and trainings
Expert-led consultation and facilitation
The latest content delivered to your inbox

You've reached your limit of free insights

Become a member to access all of Advisory Board's resources, events, and experts

Never miss out on the latest innovative health care content tailored to you.

Benefits include:

Unlimited access to research and resources
Member-only access to events and trainings
Expert-led consultation and facilitation
The latest content delivered to your inbox
AB
Thank you! Your updates have been made successfully.
Oh no! There was a problem with your request.
Error in form submission. Please try again.