Medical vision-language models (VLMs) are increasingly trained with reinforcement learning (RL) on top of supervised fine-tuning (SFT). But when does RL actually help — and how much of the apparent gain is due to the vision encoder, the SFT stage, or RL itself?
This project systematically disentangles these contributions across a range of medical imaging benchmarks, providing clearer guidance for practitioners on when RL is worth the extra complexity.
We study when reinforcement learning provides meaningful gains over supervised fine-tuning for medical vision-language models, disentangling the contributions of vision encoders, SFT, and RL to model performance.
@article{shaban2026medbridgerl,title={When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains},author={Jeddi, Ahmadreza and Shaban, Kimia and Baghbanzadeh, Negin and Sharan, Natasha and Moturu, Abhishek and Dolatabadi, Elham and Taati, Babak},year={2026},journal={arXiv preprint},}