A recent study published in the ArXiv preprint* server discusses the optimization of large language models (LLMs) for accurate differential diagnosis (DDx).
Study: Towards Accurate Differential Diagnosis with Large Language Models. Image Credit: novak.elcic / Shutterstock.com
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Background
Accurate diagnosis is the first step in effective medical care. It has been perceived that artificial intelligence (AI)-based models can be used to assist clinicians for accurate diagnosis of a disease.
The real-world diagnostic process involves an interactive and iterative process with rational reasoning about a DDx. A physician weighs different diagnostic possibilities based on varied clinical information procured from advanced diagnostic procedures.
Deep learning has been applied to the generation of DDx in ophthalmology, dermatology, and radiology. Due to the absence of interactive capabilities, deep learning models cannot assist patients with diagnosis through fluent communication in their native language. This interactive shortcoming can be overcome with the development of LLMs, which can be used to design effective tools for DDx.
LLMs are trained using a massive amount of text, which helps them summarize, recognize, predict, and generate new next. These models exhibit the capacity to process complex language comprehension and reasoning tasks.
GPT-4, a common form of LLM and medical domain-specialized LLMs like Med-PaLM 2, have performed significantly well in multiple-choice medical queries. However, each LLM evaluation experiences the challenge of considering real-world scenarios for care delivery.
It is not well understood how these models can actively assist clinicians in the development of a DDx. However, recent studies have shown that these models can be used for complex deduction of a single case.
About the study
The current study investigated whether an LLM designed for clinical diagnostic reasoning can generate a DDx in real-world medical cases. In contrast to previous models, the present study integrated this LLM model with an interactive interface and assessed whether it can assist clinicians in generating a DDx.
A set of challenging real-world cases was obtained from the New England Journal of Medicine (NEJM) and was used to compare clinicians’ ability to generate a DDx. This study compared the clinician’s capacity to develop a DDx based on using the newly optimized LLM and traditional information retrieval tools, such as books and internet search engines.
A total of twenty United States board-certified clinicians with a median experience of nine years analyzed the case reports. An automated approach was used to compare the newly developed LLM for DDx with a baseline LLM performance by GPT-4.
Study findings
The optimized LLM performed significantly well in generating a DDx list comprising correct diagnosis and identifying the final diagnosis accurately. Compared to the previous state-of-the-art GPT-4 model, the newly developed automated LLM model exhibited better quality and accuracy in generating a DDx list. Based on the quality of the DDx lists, the new LLM approach improved the diagnostic capacity of clinicians.
The current study used semi-structured qualitative interviews to obtain relevant information from clinicians on the user experience of using the tool. The risks associated with LLMs in medical diagnosis were discussed, along with their view on how this tool can be used for the differential diagnosis process.
These interviews indicated the importance of LLMs in improving the diversity of DDx lists. The strategy to enhance the speed of generating a comprehensive DDx for challenging cases was also highlighted.
The study findings align with previous studies that evaluated the performance of LLMs and a pre-LLM “DDx generator” using smaller subsets of the NEJM Clinicopathological Conference (CPC). These studies indicated the potential of automated technology to accurately generate correct DDx in challenging cases.
The newly developed LLM can be used to generate a DDX with a higher degree of appropriateness and comprehensiveness than those produced by physicians. Based on the NEJM CPC data, the current LLM model can provide an enhanced number of relevant DDx as compared to the clinician’s assessment with higher accuracy.
Conclusions
The newly developed LLM model was able to generate a DDx that could have an important role in clinical case management. Nevertheless, future research is needed to explore how LLMs could enhance clinicians’ DDx in some instances with varying risks and specificity and validate the current LLM’s suitability in clinical settings.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Journal reference:
- Preliminary scientific report.
McDuff, D., Schaekermann, M., Tu, T., et al. (2023) Towards Accurate Differential Diagnosis with Large Language Models. ArXiv. doi:10.48550/arXiv.2312.00164