[2026/03] Our paper is available in the proceedings of AAAI2026 here.
[2026/02] Code and data are released.
[2026/01] Check our video presentation in Underline!
[2025/11] Our paper is accepted by AAAI2026!
We propose a novel framework RV-Bench for benchmarking LLMs' genuine mathematical reasoning capabilities via random variables, the paper is available here.
The GPU resources we use in our study is 2*A100-SXM4-80G with the corresponding CUDA version 12.2, please adjust your GPUs according to the model size.
# Clone the repository
git clone https://github.com/Rcrossmeister/RV-Bench.git
cd ./RV-Bench
# Create the conda environment
conda create -n rvbench python=3.10
conda activate rvbench
# Install the required packages
pip install -r requirements.txtUse generation/generate.py to produce RV questions from the question functions. Each question function generates multiple random variable (RV) instantiations of a mathematical problem.
# Generate all RV questions (115 MATH + 115 LeetCode, 5 per group)
python -m generation.generate --source all --num_per_group 5 --seed 42 --output_dir ./data
# Generate only MATH-RV questions
python -m generation.generate --source math --num_per_group 5 --seed 42
# Generate only LeetCode-RV questions
python -m generation.generate --source leetcode --num_per_group 5 --seed 42
# Generate for a specific question function
python -m generation.generate --qf_id math_001 --num_per_group 10The generated questions will be saved as JSON files in the --output_dir directory (default: ./data).
Use evaluation/evaluate.py to compute the four RV-Bench metrics (Acc, GA@n, CR, OOR) without any external dependencies.
# Basic evaluation (computes Acc, GA@n, CR)
python -m evaluation.evaluate \
--predictions ./results/predictions.json \
--labels ./data/math_rv.json \
--output_dir ./results
# Full evaluation including OOR metric (requires original SP data)
python -m evaluation.evaluate \
--predictions ./results/rv_predictions.json \
--labels ./data/math_rv.json \
--sp_predictions ./results/sp_predictions.json \
--sp_labels ./data/math_sp.json \
--output_dir ./resultsThe prediction file should be a JSON list where each entry has "id" and "answer" fields matching the label file format.
We use OpenCompass as the evaluation framework (many thanks to the excellent work produced by the OpenCompass team!). Please install OpenCompass first:
git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -e .Then, copy the dataset and configuration files into your OpenCompass installation:
- Copy the custom dataset classes:
cp evaluation/opencompass_config/mathrv.py <opencompass_root>/opencompass/datasets/
cp evaluation/opencompass_config/leetcoderv.py <opencompass_root>/opencompass/datasets/
cp evaluation/opencompass_config/mathsp.py <opencompass_root>/opencompass/datasets/
cp evaluation/opencompass_config/leetcodesp.py <opencompass_root>/opencompass/datasets/- Register the datasets in
<opencompass_root>/opencompass/datasets/__init__.py:
from .mathrv import MATHRVDataset, MATHRVEvaluator
from .leetcoderv import LeetCodeRVDataset, LeetCodeRVEvaluator
from .mathsp import MATHSpDataset, MATHSpEvaluator
from .leetcodesp import LeetCodeSpDataset, LeetCodeSpEvaluator- Copy the dataset configuration files:
mkdir -p <opencompass_root>/configs/datasets/mathrv
mkdir -p <opencompass_root>/configs/datasets/leetcoderv
mkdir -p <opencompass_root>/configs/datasets/mathsp
mkdir -p <opencompass_root>/configs/datasets/leetcodesp
cp evaluation/opencompass_config/mathrv_gen.py <opencompass_root>/configs/datasets/mathrv/
cp evaluation/opencompass_config/leetcoderv_gen.py <opencompass_root>/configs/datasets/leetcoderv/
cp evaluation/opencompass_config/mathsp_gen.py <opencompass_root>/configs/datasets/mathsp/
cp evaluation/opencompass_config/leetcodesp_gen.py <opencompass_root>/configs/datasets/leetcodesp/- Copy the data files:
mkdir -p <opencompass_root>/data/mathrv
mkdir -p <opencompass_root>/data/leetcoderv
mkdir -p <opencompass_root>/data/mathsp
mkdir -p <opencompass_root>/data/leetcodesp
cp data/math_rv.json <opencompass_root>/data/mathrv/mathrv.jsonl
cp data/leetcode_rv.json <opencompass_root>/data/leetcoderv/leetcoderv.jsonl
cp data/math_sp.json <opencompass_root>/data/mathsp/mathsp.jsonl
cp data/leetcode_sp.json <opencompass_root>/data/leetcodesp/leetcodesp.jsonl- Run evaluation (using vLLM acceleration as an example):
cd <opencompass_root>
# Evaluate open-source models with vLLM
CUDA_VISIBLE_DEVICES=0,1 python run.py \
--models hf_qwen2_5_7b_instruct \
--datasets mathrv_gen leetcoderv_gen mathsp_gen leetcodesp_gen \
--mode infer -a vllmRankings are based on RV-Bench Acc (overall accuracy), which measures the overall exact match accuracy across all RVQs in both MATH-RV and LeetCode-RV. For the proposed evaluation metrics: Acc, GA@5, and CR, higher values are better; and for OOR, lower values are better. A ~ in the "Size" column indicates that the model is proprietary and its size is not publicly disclosed.
| # | Model | Size | MATH-Sp Acc | MATH-RV Acc | MATH-RV GA@5 | MATH-RV CR | MATH-RV OOR | LC-Sp Acc | LC-RV Acc | LC-RV GA@5 | LC-RV CR | LC-RV OOR | RV-Bench Acc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | o3-mini | ~ | 97.39 | 92.52 | 82.61 | 87.83 | 6.09 | 82.61 | 77.57 | 61.74 | 67.83 | 6.09 | 85.05 |
| 2 | DeepSeek-R1 | 671B | 100.00 | 92.52 | 85.22 | 88.70 | 6.09 | 80.00 | 72.17 | 52.17 | 57.39 | 5.22 | 82.35 |
| 3 | o1-mini | ~ | 90.43 | 84.00 | 67.83 | 80.87 | 5.22 | 76.52 | 66.09 | 41.74 | 51.30 | 6.09 | 75.05 |
| 4 | Gemini-2.0-Pro | ~ | 92.17 | 84.17 | 71.30 | 78.26 | 8.70 | 72.17 | 60.17 | 34.78 | 42.61 | 8.70 | 72.17 |
| 5 | DeepSeek-v3 | 671B | 89.57 | 85.04 | 72.17 | 76.52 | 5.22 | 66.09 | 58.26 | 34.78 | 37.39 | 12.17 | 71.65 |
| 6 | GLM-Zero-Preview | ~ | 92.17 | 83.13 | 65.22 | 77.39 | 6.09 | 66.96 | 60.00 | 35.65 | 44.35 | 9.57 | 71.57 |
| 7 | QwQ-32B-Preview | 32B | 91.30 | 83.83 | 60.87 | 79.13 | 5.22 | 62.61 | 58.96 | 30.43 | 42.61 | 7.83 | 71.40 |
| 8 | Claude-3.5-Sonnet | ~ | 88.70 | 80.35 | 63.48 | 73.04 | 6.09 | 70.43 | 61.39 | 35.65 | 42.61 | 8.70 | 70.87 |
| 9 | Qwen2.5-Max | ~ | 88.70 | 81.39 | 63.48 | 74.78 | 6.96 | 72.17 | 58.43 | 33.04 | 42.61 | 12.17 | 69.91 |
| 10 | Qwen2.5-72B-It | 72B | 87.83 | 81.04 | 62.61 | 76.52 | 6.09 | 66.09 | 58.43 | 29.57 | 40.00 | 10.43 | 69.74 |
| 11 | Qwen2.5-32B-It | 32B | 90.43 | 80.00 | 61.74 | 73.91 | 4.35 | 69.57 | 55.48 | 26.09 | 39.13 | 12.17 | 67.74 |
| 12 | GLM-4-Plus | ~ | 86.09 | 77.91 | 53.91 | 71.30 | 6.96 | 66.96 | 55.30 | 26.96 | 38.26 | 14.78 | 66.61 |
| 13 | o1-preview | ~ | 80.87 | 75.83 | 42.61 | 59.13 | 6.96 | 66.09 | 54.78 | 32.17 | 40.87 | 9.57 | 65.31 |
| 14 | GPT-4o | ~ | 83.48 | 76.70 | 57.39 | 63.48 | 6.09 | 61.74 | 50.09 | 20.00 | 32.17 | 13.04 | 63.40 |
| 15 | Phi-4 | 14B | 77.39 | 72.00 | 53.04 | 61.74 | 8.70 | 60.00 | 54.78 | 26.96 | 34.78 | 9.57 | 63.39 |
Please cite our paper if you include RV-Bench in your work:
@inproceedings{hong2026benchmarking,
title = {Benchmarking LLMs’ Mathematical Reasoning with Unseen Random Variables Questions},
author = {Hong, Zijin and Wu, Hao and Dong, Su and Dong, Junnan and Xiao, Yilin and Zhang, Yujing and Wang, Zhu and Huang, Feiran and Li, Linyi and Yang, Hongxia and Huang, Xiao},
booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
year = {2026}
}Feel free to reach out via email if you need any help:
zijin.hong@connect.polyu.hk
