R2MED: A Benchmark for Reasoning-Driven Medical Retrieval

1GSAI, Renmin University of China, 2Beijing Academy of Artificial Intelligence

Abstract

We introduce R2MED, the first benchmark explicitly designed for reasoning-driven medical retrieval. It comprises 876 queries spanning three tasks: Q&A reference retrieval, clinical evidence retrieval, and clinical case retrieval.

We evaluate 15 widely-used retrieval systems on R2MED and find that even the best model achieves only 31.4 nDCG@10, demonstrating the benchmark’s difficulty. Classical re-ranking and generation-augmented retrieval methods offer only modest improvements. Although large reasoning models improve performance via intermediate inference generation, the best results still peak at 41.4 nDCG@10.

These findings underscore a substantial gap between current retrieval techniques and the reasoning demands of real clinical tasks. We release R2MED as a challenging benchmark to foster the development of next-generation medical retrieval systems with enhanced reasoning capabilities.

Paper Overview

Teaser Image

R2MED advances retrieval from semantic matching to reasoning.

Leaderboard

We report the average nDCG@10 score across 8 datasets in R2MED. LLM+Retrievers means LLM generates answers before retrieval (HyDE), details please refer to https://github.com/R2MED/R2MED/tree/main/src .

🌟 Leaderboard submission 🌟
If you would like to submit your results to the leaderboard, email the results to 2021000171@ruc.edu.cn! Optionally, you are encouraged to provide the link to the open-sourced codebase. Otherwise, you may provide a short description on the used models and approaches (e.g., size of retrieval model, whether LLMs like GPT-4 are used, etc.)!

Rank LLM Retriever Avg. Bio-logy Bio-informatics Medical-Sciences MedXpert-QAExam MedQA-Diag PMC-Treatment PMC-Clinical IIYi-Clinical
o3-mini NV-Embed-v2
41.35 34.0155.9051.2828.9940.3048.9750.8620.47
o3-mini BM25
41.01 59.6546.5647.1734.6455.2241.6535.327.86
HuatuoGPT-o1-70B NV-Embed-v2
39.56 31.2552.8149.5525.2538.3348.9348.5721.77
GPT4o NV-Embed-v2
39.37 33.6154.1550.8323.0836.0947.3548.5121.30
o3-mini Text-embedding-3-large
39.09 31.548.6148.1931.7439.5151.6938.9222.56
DeepSeek-R1-Distill-Llama-70B NV-Embed-v2
38.52 32.8353.3150.3222.9833.7847.0446.5321.35
HuatuoGPT-o1-70B Text-embedding-3-large
38.24 29.7747.748.427.4636.7952.9339.2623.63
Search-O1 (QwQ-32b) NV-Embed-v2
38.22 31.8253.3351.3221.6832.845.9347.3721.52
Search-O1 (Qwen3-32b) NV-Embed-v2
38.00 33.4651.0450.2022.9132.0246.8846.1821.32
Search-O1 (Qwen3-32b) Text-embedding-3-large
37.87 34.7345.3847.1826.9533.7150.7840.0224.17
GPT4o Text-embedding-3-large
37.79 32.1545.9947.9727.2836.9251.2438.9621.82
GPT4o BM25
37.70 57.3443.0241.2226.5549.640.6132.810.42
DeepSeek-R1-Distill-Llama-70B Text-embedding-3-large
37.36 30.0847.4448.9824.9433.2450.4941.3922.32
Search-O1 (QwQ-32b) Text-embedding-3-large
37.30 32.4346.9547.8425.1931.2251.6736.5826.51
Llama3.1-70B-Ins NV-Embed-v2
36.82 31.2152.2751.1917.4827.5346.9646.9021.05
QwQ-32B NV-Embed-v2
36.82 32.2652.4349.9121.0831.2946.1441.0620.38
QwQ-32B Text-embedding-3-large
36.51 32.5145.4646.7624.8131.7652.7835.122.89
DeepSeek-R1-Distill-Qwen-32B NV-Embed-v2
36.08 33.1051.8249.3918.7827.3845.9442.1620.05
Llama3.1-70B-Ins Text-embedding-3-large
35.76 31.3146.8348.1421.4228.3251.2237.1121.7
Search-O1 (QwQ-32b) BM25
35.57 60.7841.5843.9324.0739.9437.5228.388.34
Qwen2.5-32B-Ins NV-Embed-v2
35.34 31.3452.3549.7616.4022.7745.3143.4021.35
Search-R1 (Qwen2.5-7b-it-em-ppo) NV-Embed-v2
35.11 30.8450.6649.0715.0520.4647.3645.4921.95
QwQ-32B BM25
35.03 58.2442.3542.223.738.1241.6526.667.34
DeepSeek-R1-Distill-Qwen-32B Text-embedding-3-large
34.68 30.3944.8947.6220.3327.8949.0436.2621.04
Qwen2.5-32B-Ins Text-embedding-3-large
34.25 31.3745.4646.4421.1324.6948.0534.4522.38
Search-O1 (Qwen3-32b) BM25
33.84 56.8439.4843.012234.4241.6226.27.16
HuatuoGPT-o1-70B BM25
33.43 49.7740.0238.4721.9139.5439.328.3810.01
DeepSeek-R1-Distill-Llama-70B BM25
33.29 49.5238.8639.4822.1238.434.9533.869.13
Search-R1 (Qwen2.5-7b-it-em-ppo) Text-embedding-3-large
32.89 28.343.648.0315.218.5748.3838.1522.92
Qwen2.5-7B-Ins NV-Embed-v2
32.69 30.1249.9549.3913.3719.4942.9938.3617.86
Llama3.1-70B-Ins BM25
32.40 52.5439.4241.0516.9933.8737.3228.679.32
Search-R1 (Qwen2.5-3b-it-em-ppo) NV-Embed-v2
31.74 25.7647.5347.5711.9818.8845.6638.5717.95
Qwen2.5-7B-Ins Text-embedding-3-large
31.53 30.1542.3345.7915.4519.7348.6430.3919.79
- NV-Embed-v2
31.43 27.1550.1047.8110.9016.7244.0539.9114.81
o3-mini BGE-Large-en-v1.5
31.29 22.1837.6243.9725.3934.9144.5524.6517.04
- GritLM-7B
31.12 24.9943.9845.9412.3219.8639.8837.0824.94
Qwen2.5-32B-Ins BM25
31.12 52.8839.242.4216.825.8733.2726.7211.81
- SFR-Embedding-Mistral
30.65 19.5645.9146.0111.9817.4944.1936.3623.71
Search-R1 (Qwen2.5-3b-it-em-ppo) Text-embedding-3-large
30.19 24.4840.9147.6511.6216.7847.7932.2720
- BMRetriever-7B
30.18 23.6244.0144.9111.5516.9546.8829.1424.36
DeepSeek-R1-Distill-Qwen-32B BM25
29.05 48.938.838.2816.0425.2131.5522.4711.15
HuatuoGPT-o1-70B BGE-Large-en-v1.5
28.98 18.8835.539.722227.4843.2926.4618.5
Search-O1 (QwQ-32b) BGE-Large-en-v1.5
28.66 21.9232.3442.7120.124.3442.9126.3118.66
GPT4o BGE-Large-en-v1.5
28.63 22.5932.9741.2919.4527.1845.4323.2816.85
- Text-embedding-3-large
28.57 23.8240.5144.0511.7815.0147.4328.8717.12
DeepSeek-R1-Distill-Llama-70B BGE-Large-en-v1.5
28.41 21.6834.7940.3319.0623.9342.7426.4418.28
Search-O1 (Qwen3-32b) BGE-Large-en-v1.5
28.28 23.8830.3442.1618.4825.4942.9625.4617.47
Llama3.1-70B-Ins BGE-Large-en-v1.5
28.18 21.437.0641.4415.9122.8545.7824.6116.36
- Voyage-3
27.34 25.4238.9841.638.749.3645.2828.6820.64
QwQ-32B BGE-Large-en-v1.5
26.59 21.0132.8939.4516.6822.5543.420.7515.95
DeepSeek-R1-Distill-Qwen-32B BGE-Large-en-v1.5
26.40 19.6133.4240.7614.7819.342.3823.8417.1
Qwen2.5-7B-Ins BM25
26.38 48.4929.3841.7512.1719.4826.6324.488.67
Qwen2.5-32B-Ins BGE-Large-en-v1.5
26.19 23.633.7641.5114.4117.9839.7722.2316.22
- E5-mistral-7b-instruct
24.92 18.8142.8641.776.7011.5423.5831.1722.93
Search-R1 (Qwen2.5-7b-it-em-ppo) BGE-Large-en-v1.5
24.78 19.1430.3837.3112.5314.8138.9324.5120.6
- BMRetriever-2B
24.69 19.5033.3039.459.979.3138.0125.6522.30
Search-R1 (Qwen2.5-7b-it-em-ppo) BM25
24.56 36.5126.9234.6210.0213.8834.9828.1811.37
Qwen2.5-7B-Ins BGE-Large-en-v1.5
24.00 19.8930.1840.712.9815.0541.0217.0915.07
Search-R1 (Qwen2.5-3b-it-em-ppo) BGE-Large-en-v1.5
22.14 16.826.8336.748.412.1737.1522.8416.16
Search-R1 (Qwen2.5-3b-it-em-ppo) BM25
20.03 34.3820.3130.045.4713.1831.6619.695.53
- InstructOR-XL
18.13 21.5632.9136.794.634.2914.1814.4916.17
- BMRETRIEVER-410M
18.10 12.3729.9231.264.466.2825.3117.4617.73
- BGE-Large-en-v1.5
17.02 12.7127.0427.764.108.3326.4515.0614.72
- InstructOR-L
16.21 15.8229.7136.883.844.8115.849.0213.77
- BM25
15.13 19.1921.5519.680.662.5523.6921.6612.02
- Contriever
11.76 9.1518.0225.221.712.5211.4713.4012.57
- MedCPT
9.02 2.1517.5714.741.682.0211.3314.628.03

BibTeX

@article{li2025r2med,
  title={R2MED: A Benchmark for Reasoning-Driven Medical Retrieval},
  author={Li, Lei and Zhou, Xiao and Liu, Zheng},
  journal={arXiv preprint arXiv:2505.14558},
  year={2025}
}