arxiv.org web signal

FaithMed trains LLMs to appraise medical evidence step by step

TL;DR

  • FaithMed uses clinician-designed rubrics refined automatically to supervise how LLMs appraise evidence during medical reasoning.
  • Across seven medical benchmarks, the paper reports +9% over agentic-search baselines and +5.8% over outcome-only RL.
  • Evidence-based medicine rubric scores rose 15.5% against agentic-search Qwen3 baselines, per the abstract.

Medical LLMs mostly get graded on whether their final answer matches the answer key. A new arxiv preprint from Zhiyun Zhang, Liwen Sun, Xiang Qian and Chenyan Xiong argues that framing isn't enough for medicine, where the reasoning behind a recommendation is part of the recommendation. They introduce FaithMed, a framework that supervises how a model appraises evidence at each step, not just what it concludes.

The method combines clinician-designed rubrics, which are then automatically refined, with reinforcement learning that assigns rewards at the process level rather than only at the outcome. The abstract frames the gap plainly: current medical LLMs either lack active access to evidence or use retrieved evidence 'without supervising how it should be appraised and applied during reasoning.' On top of the step-level rewards, FaithMed layers what the authors call an advantage grouping strategy.

The reported numbers, from the authors' own evaluation across seven medical benchmarks, are a 9% average gain over agentic-search baselines, 5.8% over outcome-only reinforcement learning, and a 15.5% lift in evidence-based medicine rubric scores against agentic-search Qwen3 baselines. Take those as reported, not settled: they come from a preprint, and the abstract doesn't break the gains down benchmark by benchmark or show where the wins concentrate.

What the reporting doesn't give you is whether working clinicians agree the reasoning traces are actually more faithful, or only that the automatic rubric says so. That is the load-bearing question if the goal is ever to put a system like this in front of a doctor. Still, the framing is the useful export: for domains where showing your work matters as much as the answer, step-level process rewards are becoming the tool of choice, and the code the team released on GitHub should let other groups swap in their own rubrics without rebuilding the training loop.

Shared on Bluesky by 2 AI experts