A groundbreaking study conducted in the US has confirmed the superior diagnostic capabilities of artificial intelligence (AI) over human clinicians.
The research pitted Open AI’s GPT-4 against 553 medical practitioners in a series of five cases with scientific reference standards.
According to the study, large language models (LLMs) can convincingly solve difficult diagnostic cases, pass licensing examinations and communicate empathetically with patients, suggesting that they have an emergent understanding of clinical reasoning.
Notably, the AI exhibited lower error rates in pre-test and post-test probability estimates following negative results, showcasing its precision.
However, the LLM did not perform well after positive test results.
????#AI are better than we are at guessing a negative diagnosishttps://t.co/4NXM8gXfJl pic.twitter.com/sSEU1vr2hx— Australian Science Media Centre (@AusSMC) December 12, 2023
Limitations
Study limitations include the use of a simple input-output prompt design strategy; other approaches may yield better results and deserve investigation.
Cases were also simplistic to have clear reference standards.
Future research will need to investigate LLM performance in more complex cases.
It is not clear why the performance of the LLM was not as strong in post-test probability after a positive result.
However, even if imperfect, probabilistic recommendations from LLMs might improve human diagnostic performance through collective intelligence, especially if AI diagnostic aids can combine probabilistic, narrative, and heuristic approaches to diagnosis.