You may have seen headlines about a buzzy new study that found ChatGPT is dreadful at answering medical questions.
Here's how CNN reported the news: "Researchers at Long Island University posed 39 medication-related queries to the free version of the artificial intelligence chatbot. … ChatGPT provided accurate responses to only about 10 of the questions, or about a quarter of the total."
Which sounds bad! But if you've read other studies about ChatGPT's capabilities, it might also sound confusing.
After all, past research has shown ChatGPT matches surgeons' performance in explaining the risks and benefits of surgeries and can even diagnose complicated clinical case studies. So why did ChatGPT perform so impressively in those studies and so poorly in this one?
I'd say three core factors are at play — but in short, different researchers have explored very different AI use cases.
For their new study, Long Island University researchers used the free version of ChatGPT, which runs the GPT-3.5 language model.
That's an updated version of GPT-3, which was released in 2020. In other words, it isn't today's most powerful AI by a long shot.
By contrast, when prior researchers tested ChatGPT's diagnostic abilities, they used GPT-4, a state-of-the-art model available via ChatGPT's $20 monthly subscription. Unsurprisingly, they achieved far more impressive results.
This doesn't mean the Long Island University researchers did anything wrong. As CNBC reports, they "focused on the free version of the chatbot to replicate what more of the general population uses and can access." (In fact, OpenAI has paused sign-ups for the paid ChatGPT.)
Still, the researchers' findings speak only to the typical patient's experience of AI, not to the limits of AI's capabilities.
Another difference involved how the researchers posed their questions to ChatGPT. As I've written previously, you can significantly improve an AI's responses via more effective prompting — a practice called "prompt engineering."
When past researchers achieved impressive results with ChatGPT, they often used lengthy, sophisticated prompts. For instance, here's part of one study's seven-paragraph prompt:
"A clinicopathological case conference has several unspoken rules. The first is that there is most often a single definitive diagnosis (though rarely there may be more than one), and it is a diagnosis that is known today to exist in humans. The diagnosis is almost always confirmed by some sort of clinical pathology test or anatomic pathology test, though in rare cases when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or very rarely just confirmed by expert opinion."
By comparison, the Long Island University researchers reportedly used "real questions posed to Long Island University's College of Pharmacy drug information service over a 16-month period in 2022 and 2023."
This is a completely defensible choice: Most patients aren't prompt engineers, so real-world users' results would likely resemble those found in this study. Still, better prompts might have achieved better outcomes.
Finally, the researchers "asked ChatGPT to provide references so that the information provided could be verified," according to a press release.
The results? "References were provided in just eight responses, and each included non-existent references."
Frankly, that's unsurprising. Hallucinations are a known hazard of large language models like ChatGPT, and they're especially commonplace in medical or legal citations (which is why LexisNexis recently made a splash with an AI promising "hallucination-free" citations).
And on this point, I'll gently critique the researchers. Few real-world patients would demand citations from their AI — so asking for references feels like a decision made to highlight AI's shortcomings, rather than to mirror patients' experiences.
So where does this morass of confusing research leave us? For now, we must hold two opposing thoughts at the same time.
If you're a patient, AI isn't ready for primetime. ChatGPT can sometimes offer limited guidance on medical questions, but you shouldn't trust its answers without a ton of extra verification.
If you're a clinician, however, AI is already capable of astounding feats. You shouldn't rely on ChatGPT to diagnose your patients, for many reasons, but nor should you imagine that AI "can't" do (much of) that job.
In the years to come, tools like ChatGPT will only get better. They'll be fine-tuned for medical use, built into EHRs and other tools, and tested in clinical settings. The typical user will start seeing results as good as, or better than, the most sophisticated users achieve today.
Healthcare's AI era isn't here yet … but it isn't far away.
Google's new, industry-leading AI model is here … sort of. Last week's big AI news was Google's release of Gemini, a new AI model that, in its most powerful form, beats GPT-4 on many benchmarks. That said, a few caveats: (1) Google hasn't actually released that most sophisticated version, called Gemini Ultra, to the public; (2) Google's impressive-looking launch video has come under fire for being so heavily edited as to be, in critics' eyes, misleading; and (3) although Gemini technically beats ChatGPT, the margin of defeat is tiny — which may weakly suggest that the current generation of AI models is hitting a capability plateau.
What is OpenAI's rumored Q* breakthrough? If you followed the news of CEO Sam Altman's departure from, and return to, OpenAI, you may have heard rumblings that his firing stemmed from worries about a machine-learning breakthrough called "Q*". Subsequent reporting has suggested that isn't true, but nonetheless, Q* appears to be a real area of research. Writing on his blog Understanding AI, journalist Timothy B. Lee dives into what Q* (probably) is and why it matters.
'An opinionated guide to which AI to use.' Ethan Mollick of The Wharton School has updated his cheat sheet on which generative AI to use for day-to-day work tasks. The shortest version is that, if you use just one AI tool, GPT-4 (available via the paid ChatGPT Plus or the free Bing search engine) is your best bet. But for certain niche use cases, such as generating images or understanding long texts, other tools may be better.
If ChatGPT isn't working for you, maybe offer it a tip? A Twitter user found that ChatGPT provided much more detailed responses when promised a $20 tip for a good answer (and even longer responses for a $200 tip!). Needless to say, ChatGPT doesn't actually accept tips … and personally, I'd feel morally icky about promising a semi-intelligent agent a tip that I can't deliver. Still, the fact that this tactic works at all helps to illuminate, yet again, the deep weirdness of current AI models.
AI is a powerful tool that can be used to enhance patient care, reduce costs, and improve outcomes. But it's important to remember that AI is not a magic bullet. Get three key takeaways from Advisory Board's recent webinar on how healthcare organizations should approach AI adoption and prepare for the challenges that come with it.
Create your free account to access 1 resource, including the latest research and webinars.
You have 1 free members-only resource remaining this month.
1 free members-only resources remaining
1 free members-only resources remaining
You've reached your limit of free insights
Never miss out on the latest innovative health care content tailored to you.
You've reached your limit of free insights
Never miss out on the latest innovative health care content tailored to you.
This content is available through your Curated Research partnership with Advisory Board. Click on ‘view this resource’ to read the full piece
Email ask@advisory.com to learn more
Never miss out on the latest innovative health care content tailored to you.
This is for members only. Learn more.
Never miss out on the latest innovative health care content tailored to you.