Clinical LLM apps need retrieval-grounded evaluation, not just AUC

Same MPhil exploration period that gave me the Clinical Imaging letter — reading widely across clinical AI, evaluation methodology, and infectious-disease epidemiology. JMIR AI published a study using fine-tuned LLaMA2 / Flan-T5 for pediatric COVID-19 severity risk assessment, deployed as a conversational app. The evaluation reported AUC.

AUC tells you the classifier is calibrated on held-out data. It does not tell you whether the natural-language explanations the deployed app produces are grounded in actual clinical guidance — only whether the underlying yes/no token probabilities discriminate on the dataset.

That gap matters. The deployed system is a chat interface giving caregivers fluent risk explanations. Fluent + calibrated is not the same as grounded. A model can confidently invent a CDC recommendation that doesn’t exist.

I wrote a methodological letter for JMIR AI arguing for a parallel evaluation track. Same pipeline, evaluated twice:

  1. LLM-only, current setup. Score on AUC, accuracy, calibration.
  2. Retrieval-grounded against a fixed clinical corpus (CDC pediatric COVID guidance + WHO + IDSA). Score on citation faithfulness, evidence-grounded correctness, and subgroup robustness.

This is an architectural shift, not a tuning refinement. RAG is not just a deployment pattern; it is an evaluation substrate.

Letter in JMIR AI, 2026: link.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • An accidental international team — and our first epidemic-forecasting paper
  • BFR with sprinters — a collaborative RCT
  • Food policy is still regulating streets — we wrote to The Lancet about screens
  • End-to-end LLM clinical triage misses the steps that matter
  • From a Cambridge nutritional epi class to a letter in PHN