Artificial intelligence (AI) has advanced tremendously in recent months, with some research finding that it can create clinical notes on par with those written by medical residents. However, researchers say that healthcare leaders should remain cautious about using AI for medical care since it can still produce problematic and biased results.
In a new preprint study, researchers inputted several case studies from the New England Journal of Medicine Healer tool into the generative AI model GPT-4 and asked it to provide a list of potential diagnoses and treatment recommendations for each scenario.
The case studies included a range of patient symptoms including chest pain, difficulty breathing, sore throat, and more. Each time, the researchers would change the patient's gender and race to see how GPT-4 would adjust its output.
Overall, GPT-4's answers did not differ significantly between groups, but the model did rank possible diagnoses differently depending on a potential patient's gender or race.
For example, when GPT-4 was told that a female patient had shortness of breath, it ranked panic and anxiety disorder higher on its list of potential diagnoses, which the researchers say reflect known biases in the clinical literature used to train the model.
In addition, when a patient with a sore throat was presented to GPT-4, it made the correct diagnosis (mono) 100% when the patient was white, but only 86% of the time for Black men, 73% for Hispanic men, and 74% for Asian men.
Treatment suggestions also varied by race and gender. For all 10 ED cases presented to the model, it was significantly less likely to recommend a CT scan for a Black patient and less likely to rate cardiovascular stress tests and angiography as being of high importance for women compared to men.
In addition, while some variation is expected — and even wanted — in a list of potential diagnoses, the researchers found that the AI often overestimated the real-world prevalence of certain diseases, which could amplify certain trends when applied to training or clinical practice.
For example, when the researchers asked GPT-4 to generate clinical vignettes of a sarcoidosis patient, the model described a Black woman 98% of the time.
"Sarcoidosis is more prevalent both in African Americans and in women," said Emily Alsentzer, a postdoctoral fellow at Brigham and Women's Hospital and Harvard Medical School and one of the study's authors, "but it's certainly not 98% of all patients."
Adam Rodman, co-director of the iMED Initiative at Beth Israel Deaconess Medical Center, said because GPT-4 was trained off human communication, it "shows the same — or maybe even more exaggerated — racial and sex biases as humans."
"Despite years of training these things to be less terrible, they still reflect many of these more subtle biases," he added. "It still reflects the biases of its training data, which is concerning given what people are using GPT for right now."
And if these subtle biases are not checked by physicians using GPT-4, "it's hard to know whether there might be systemic biases in the response that you give to one patient or another," Alsentzer said — as well as whether the model could exacerbate existing health disparities.
Although these types of biases are not surprising to AI researchers, Rodman said "it's really, really concerning" to him. "Things are moving quickly, and doctors need to get on top of this," he added.
"Medical students are using GPT-4 to learn right now," Rodman said, which means they could easily reflect or exaggerate existing biases shown to them by the model. "How are they going to second-guess an LLM [large language model] if they use an LLM to train their own brains?"
In general, researchers say that GPT-4 and similar AI models will need to be improved significantly before they can be applied to patient care management. There will also likely need to be safeguards built into the technology before it's used for clinical decision making.
"No one should be relying on it to make a medical decision at this point," Rodman said. "I hope it hammers home the point that doctors should not be relying on GPT-4 to make management decisions." (Palmer, STAT+ [subscription required], 7/18; Zack et al., medRxiv, 7/17)
Learn how to reduce the risk of algorithmic bias in healthcare with this infographic that outlines challenges and steps to take.
Create your free account to access 1 resource, including the latest research and webinars.
You have 1 free members-only resource remaining this month.
1 free members-only resources remaining
1 free members-only resources remaining
Never miss out on the latest innovative health care content tailored to you.