Two AI chatbots passed the USMLE, doctors react

By Kristen Fuller, MD | Fact-checked by MDLinx staff

Published January 24, 2023

Key Takeaways

Two artificial intelligence (AI) programs reportedly passed all steps of the USMLE exam, but only with 60% and 67% scores, indicating they may have a margin for error.
Despite its issues with accuracy, AI may be helpful in some applications, potentially saving time and money for medical practices. But there are some concerns about its potential flaws and lack of empathy.
Ongoing development of AI technologies may yield improved programs that can better assist healthcare practices in times to come. Clinicians may continue to exercise caution about adopting them, however.

It was like something out of a science fiction movie—but it really happened. Two artificial intelligence (AI) programs—ChatGPT and Flan-PaLM—passed all three portions of the United States Medical Licensing Examination (USMLE), the exams that must be completed to enter into a residency program.

Physicians shouldn’t worry about being replaced by AI anytime soon, however; AI's just-passing scores point to the technology’s potential flaws. But AI may be helpful to clinicians in certain aspects of medicine, and the programs could improve over time.

How did the AIs score?

AI is not new to clinical medicine. It has been used to help diagnose and treat certain diseases such as Parkinson's disease, streamline medical notetaking in electronic health records, help with reminders for patient appointments, prescription medications, and vaccination schedules, and write insurance denial letters.

Executing intellectual challenges like passing physician exams may be the next frontier for AI programs.

Two studies highlighted the differences between the two AI programs that passed all three levels of the USMLE.

The first study, published by medRxiv in December 2022, investigated the AI ChatGPT's performance on the USMLE, which it achieved without any special training or reinforcement prior to the exams.^[] ChatGPT performed at > 50% accuracy across all of the exams and answered 60% of the questions correctly.

The second paper, published by arXiv in the same month, evaluated the performance of another large language model, Flan-PaLM, on the USMLE.^[] The key difference between the two models was that this one was heavily modified to prepare for the exams, using MultiMedQA, a collection of medical-question-answering databases.

Flan-PaLM achieved more than 67% accuracy on the USMLE questions.

Doctors' opinions

USMLE scores are one of the main deciding factors residency programs use to choose candidates. Answering 60% of the 280 questions on the exam correctly is a passing score. So how impressive were the AI programs’ USMLE scores? Not very, according to some doctors.

“One of the differences between human performance and AI performance may be in asking the right questions and/or observations,” stated MDLinx medical contributor Scott Cunningham, MD, PhD. “USMLE questions provide all of the data; the test-taker needs to connect the dots.”

"A 67% on USMLE is nothing to brag about. Would you be OK if your doctor got the diagnosis two out of three times?"

— Scott Cunningham, MD, PhD, MDLinx

Potential pitfalls of AI use in medicine

In addition to their potential for error, physicians are concerned that AI services may be biased or could compromise data privacy and security, according to a HealthITanalytics.com article.^[]

Such considerations are especially important in scenarios such as clinical practice in which mistakes could potentially endanger lives.

One thing that AI cannot learn is empathy, according to an article published by AI & Society.^[] For example, it can’t express sympathy when diagnosing a patient with a terminal disease.

In addition, a study published by BMJ Quality & Safety cited concerns that an automated system may find ways to “game” outcomes so it can achieve consistently positive results.^[]

Ways AI could help

AI may have its uses in medical facilities. These programs could be helpful in saving healthcare professionals time by handling some time-consuming tasks.

The ChatGPT program has been used to create long-form writing based on prompts from human users and, as a result, has been used to write letters to medical insurers seeking approval of claims.

Other types of AI software have been used to help streamline the entry of clinical notes into the EMR and to communicate with patients, alerting them of upcoming appointments, lab values, and test results.

The bottom line

AI offers potential benefits for medical practitioners, but research indicates it has flaws in terms of accuracy, as evidenced by its just-passing performance in USMLE exams. Privacy and bias are also concerns associated with these programs, prompting some clinicians to approach them cautiously.

As is often the case with emerging technologies, AI may improve over time.

While these programs will never replace people in providing empathetic, well-reasoned care to patients, they could further evolve in their capabilities to improve efficiencies for medical practices.

What this means for you

AI may be beneficial in some aspects of medicine and could help reduce time constraints and save money. Still, the technology may be lacking when it comes to certain aspects, such as reliability and accuracy, and in providing compassionate care. Therefore, as AI makes its way deeper into clinical medicine, it is important for physicians to proceed cautiously and do their best to determine when and where AI is helpful in clinical practice.