ChatGPT was able to generate clinical notes on par with those written by senior internal medicine residents, leaving reviewers often unable to distinguish between the AI- and human-generated notes, according to a new study published in JAMA Internal Medicine.
For the study, researchers from Stanford University asked 30 internal medicine attending physicians to blindly evaluate five sets of clinical notes for history of present illness, four of which were written by senior residents and one that was generated by ChatGPT, powered by GPT-3.5 which was released on Jan. 9, 2023. The attending physicians were asked to grade the notes on their level of detail, succinctness, and organization.
The researchers used a prompt engineering method to generate the notes written by ChatGPT, a process that involved inputting a transcript of a patient-provider interaction, then analyzing the output for errors, and using those notes to modify the prompt. The process was repeated twice to make sure the AI produced accurate notes for the final review.
The grades given to the clinical notes written by senior residents and ChatGPT differed by less than one point on a 15-point scale, the researchers found. However, notes written by senior residents scored higher grades on average for their level of detail.
The attending physician reviewers were able to correctly determine which notes were written by ChatGPT and which were written by senior residents with just 61% accuracy.
Ashwin Nayak, lead author on the study, said the fact that ChatGPT "seem[s] to be advanced enough to draft clinical notes at a level that we would want as a clinician reviewing the charts and interpreting the clinical situation … is pretty exciting because it opens up a whole lot of doors for ways to automate some of the more menial tasks and the documentation tasks that clinicians don't love to do."
Even though there is a need for prompt engineering, Nayak emphasized the potential of utilizing AI chatbots like ChatGPT for clinical documentation.
"For lots of clinical notes, we don't need things to be perfect. We need them to be above some sort of threshold," he said. "And it seems like, in this synthetic situation, it seemed to do the job."
Nayak added that this study was conducted using GPT-3.5, which means the outcomes would likely be different with GPT-4, which was released on March 13.
"I have no doubt that if this experiment was repeated with GPT-4 the results would be even more significant," he said. "I think the notes would probably be equivalent or maybe even trending towards better on the GPT-4 side. I think physician assessment of whether a note was written by AI or human would be even worse."
However, Nayak cautioned that more research and testing is needed. "More work is needed with real patient data," he said. "More work is needed with different types of notes, different aspects of the note. We just focused on the history of present illness, which is just one section of the note."
In an accompanying editorial, Eric Ward, from the University of California, San Francisco, and Cary Gross, from Yale University, wrote that a new era is unfolding within healthcare with AI innovation and that there is a critical need for evidence-based research regarding the implementation of AI into clinical practice.
"A failure to appreciate the unique aspects of the technology could lead to incorrect or unreproducible evaluations of its performance and premature dissemination into clinical care," they wrote. "The scientific community has embraced this challenge, and health care professionals, educational institutions, and research funders should devote attention and resources to ensuring these tools are used ethically and appropriately."
Ward and Gross also emphasized that studies like this one are necessary to understand how and when AI can be used within medicine.
Alongside the study and editorial, JAMA Internal Medicine also published a research letter regarding AI performance in healthcare education, with a study that found the GPT-4 version of ChatGPT was able to outperform first- and second-year medical students at Stanford on clinical reasoning exams.
"Given the abilities of general-purpose chatbot AI systems, medicine should incorporate AI-related topics into clinical training and continuing medical education," the researchers concluded. "As the medical community had to learn online resources and electronic medical records, the next challenge is learning judicious use of generative AI to improve patient care." (DePeau-Wilson, MedPage Today, 7/17; Bruce, Becker's Health IT, 7/17)
Intelligent automation combines two powerful tools – artificial intelligence (AI) and simple process automation – to create a digital “workforce” capable of removing the friction from healthcare delivery. AI technologies are increasingly capable of replicating human brain functions while performing them much faster — and more accurately — than humans themselves. Here’s our take on four shifts healthcare leaders need to make in order to unlock the full potential of AI-enabled process automation, drive strategic goals, and maximize investments.
Create your free account to access 1 resource, including the latest research and webinars.
You have 1 free members-only resource remaining this month.
1 free members-only resources remaining
1 free members-only resources remaining
Never miss out on the latest innovative health care content tailored to you.