Comparison of CT referral justification using clinical decision support and large language models in a large European cohort

Mor Saban*, Yaniv Alon, Osnat Luxenburg, Clara Singer, Monika Hierath, Alexandra Karoussou Schreiner, Boris Brkljačić, Jacob Sosna

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Background: Ensuring appropriate use of CT scans is critical for patient safety and resource optimization. Decision support tools and artificial intelligence (AI), such as large language models (LLMs), have the potential to improve CT referral justification, yet require rigorous evaluation against established standards and expert assessments. Aim: To evaluate the performance of LLMs (Generation Pre-trained Transformer 4 (GPT-4) and Claude-3 Haiku) and independent experts in justifying CT referrals compared to the ESR iGuide clinical decision support system as the reference standard. Methods: CT referral data from 6356 patients were retrospectively analyzed. Recommendations were generated by the ESR iGuide, LLMs, and independent experts, and evaluated for accuracy, precision, recall, F1 score, and Cohen’s kappa across medical test, organ, and contrast predictions. Statistical analysis included demographic stratification, confidence intervals, and p-values to ensure robust comparisons. Results: Independent experts achieved the highest accuracy (92.4%) for medical test justification, surpassing GPT-4 (88.8%) and Claude-3 Haiku (85.2%). For organ predictions, LLMs performed comparably to experts, achieving accuracies of 75.3–77.8% versus 82.6%. For contrast predictions, GPT-4 showed the highest accuracy (57.4%) among models, while Claude demonstrated poor agreement with guidelines (kappa = 0.006). Conclusion: Independent experts remain the most reliable, but LLMs show potential for optimization, particularly in organ prediction. A hybrid human-AI approach could enhance CT referral appropriateness and utilization. Further research should focus on improving LLM performance and exploring their integration into clinical workflows. Key Points: Question Can GPT-4 and Claude-3 Haiku justify CT referrals as accurately as independent experts, using the ESR iGuide as the gold standard? Findings Independent experts outperformed large language models in test justification. GPT-4 and Claude-3 showed comparable organ prediction but struggled with contrast selection, limiting full automation. Clinical relevance While independent experts remain most reliable, integrating AI with expert oversight may improve CT referral appropriateness, optimizing resource allocation and enhancing clinical decision-making.

Original languageEnglish
Article number104779
JournalEuropean Radiology
DOIs
StateAccepted/In press - 2025

Bibliographical note

Publisher Copyright:
© The Author(s) 2025.

Keywords

  • Artificial intelligence
  • Decision support systems (clinical)
  • Guideline adherence
  • Referral and consultation
  • Tomography (X-ray computed)

Fingerprint

Dive into the research topics of 'Comparison of CT referral justification using clinical decision support and large language models in a large European cohort'. Together they form a unique fingerprint.

Cite this