CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

A curriculum-guided RL framework (code-switching-aware SFT + GRPO) for reliable multilingual medical reasoning.

Eric Onyame1*, Akash Ghosh2*, Subhadip Baidya2, Sriparna Saha2, Xiuying Chen3, Chirag Agarwal1
1University of Virginia 2IIT Patna 3MBZUAI
*Equal contribution

Abstract

While large language models (LLMs) have shown to perform well on monolingual mathematical and commonsense reasoning, they remain unreliable for multilingual medical reasoning applications, hindering their deployment in multilingual healthcare settings. We address this by first introducing CUREMED-BENCH, a high-quality multilingual medical reasoning dataset with open-ended reasoning queries with a single verifiable answer, spanning thirteen languages, including underrepresented languages such as Amharic, Yoruba, and Swahili. Building on this dataset, we propose CURE-MED, a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization to jointly improve logical correctness and language stability. Across thirteen languages, our approach consistently outperforms strong baselines and scales effectively, achieving 85.21% language consistency and 54.35% logical correctness at 7B parameters, and 94.96% language consistency and 70.04% logical correctness at 32B parameters. These results support reliable and equitable multilingual medical reasoning in LLMs.

Overview

CURE-Med pipeline: Stage 0 curation, Stage 1 supervised fine-tuning, Stage 2 GRPO-guided curriculum learning.
13 languages Code-switching-aware SFT GRPO curriculum RL Medical reasoning
Figure 1: CURE-Med training pipeline—curation of clinically validated multilingual data, supervised fine-tuning on code-switched reasoning traces, and GRPO-guided curriculum learning from high- to low-resource languages.

Key Contributions

1
Systematic multilingual medical reasoning evaluation

We evaluate multilingual medical reasoning using verifiable open-ended queries, measuring both logical correctness and language consistency across languages.

2
CUREMED-BENCH dataset (13 languages)

A large-scale benchmark spanning high-, mid-, and low-resource settings, including underrepresented languages such as Amharic, Yoruba, and Swahili.

3
CURE-Med training: code-switching-aware SFT + GRPO curriculum RL

A two-stage training framework that jointly improves reasoning correctness and linguistic fidelity, with curriculum progression from high- to low-resource languages.

4
Strong performance and robustness

Extensive automatic and human evaluation shows consistent gains over strong baselines and improved generalization, including in low-resource settings.

Method

CURE-Med combines a clinically grounded benchmark with curriculum-guided reinforcement learning to improve medical correctness while keeping the final answer in the user’s language.

1

Data Collection & Human Verification

We construct CUREMED-BENCH from clinically validated sources and generate multilingual medical reasoning queries. Native speakers and medical experts verify clinical correctness and language consistency across all languages.

2

Supervised Fine-Tuning (Warm-start)

We warm-start the model with supervised fine-tuning on a code-switched reasoning dataset across the 13 languages. This stabilizes multi-step reasoning before reinforcement learning.

3

Reward Design

We use a verifier-driven reward that encourages correct clinical conclusions while enforcing target-language fidelity and a clean, structured response format.

4

GRPO-Guided Curriculum Learning

After warm-starting, we apply GRPO with a language-resource curriculum, training progressively from higher-resource to lower-resource languages while retaining prior skills to reduce forgetting.

Live demo: code-switching reasoning
Illustrative example (not medical advice). Loops automatically.

Results

CURE-Med improves language consistency and logical accuracy across model sizes. Beyond CUREMED-BENCH, it also transfers out-of-domain to other multilingual medical benchmarks and outperforms strong medical LLM baselines. Select a model size to view gains vs. the base model, then inspect the figures.

Results at a glance

Pick a size (our model vs base instruction model).
Language Consistency
57.60%
Gain vs base: +53.60 pts
Logical Accuracy
28.32%
Gain vs base: +22.02 pts
Takeaway: CURE-Med improves medical reasoning while keeping the final answer in the user’s language — and the gains strengthen with scale.
Benchmark comparison: CURE-Med vs medical LLM baselines across multilingual medical benchmarks.
OOD benchmark transfer: CURE-Med maintains strong performance beyond CUREMED-BENCH across multiple multilingual medical QA benchmarks, matching or outperforming specialized medical LLM baselines—especially at larger scales.
Trade-off plot: CURE-Med shifts toward upper-right compared to baselines.
Consistency–accuracy trade-off: Compared to baseline families, CURE-Med shifts performance toward the upper-right, improving medical reasoning while keeping outputs in the user’s language.
Scaling: CURE-Med vs base across sizes for consistency and accuracy.
Scaling (1.5B → 32B): CURE-Med stays above the base model on both language consistency and logical accuracy, with gains that remain stable and strengthen with scale.

Citation

If you use CURE-Med or CUREMED-BENCH, please cite our arXiv paper: arXiv:2601.13262.

@misc{onyame2026curemedcurriculuminformedreinforcementlearning,
  title={CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning},
  author={Eric Onyame and Akash Ghosh and Subhadip Baidya and Sriparna Saha and Xiuying Chen and Chirag Agarwal},
  year={2026},
  eprint={2601.13262},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2601.13262}
}
arXiv Paper

Built with curiosity — scaling trustworthy medical reasoning across diverse languages, for the benefit of all.

© · CURE-Med · Project page