RV-Bench - Benchmarking LLMs via Random Variables

[2026/03] Our paper is available in the proceedings of AAAI2026 here.

[2026/02] Code and data are released.

[2026/01] Check our video presentation in Underline!

[2025/11] Our paper is accepted by AAAI2026!

We propose a novel framework RV-Bench for benchmarking LLMs' genuine mathematical reasoning capabilities via random variables, the paper is available here.

Setup

Environment

The GPU resources we use in our study is 2*A100-SXM4-80G with the corresponding CUDA version 12.2, please adjust your GPUs according to the model size.

# Clone the repository
git clone https://github.com/Rcrossmeister/RV-Bench.git
cd ./RV-Bench

# Create the conda environment
conda create -n rvbench python=3.10
conda activate rvbench

# Install the required packages
pip install -r requirements.txt

Usage

Question Generation

Use generation/generate.py to produce RV questions from the question functions. Each question function generates multiple random variable (RV) instantiations of a mathematical problem.

# Generate all RV questions (115 MATH + 115 LeetCode, 5 per group)
python -m generation.generate --source all --num_per_group 5 --seed 42 --output_dir ./data

# Generate only MATH-RV questions
python -m generation.generate --source math --num_per_group 5 --seed 42

# Generate only LeetCode-RV questions
python -m generation.generate --source leetcode --num_per_group 5 --seed 42

# Generate for a specific question function
python -m generation.generate --qf_id math_001 --num_per_group 10

The generated questions will be saved as JSON files in the --output_dir directory (default: ./data).

Evaluation (Standalone)

Use evaluation/evaluate.py to compute the four RV-Bench metrics (Acc, GA@n, CR, OOR) without any external dependencies.

# Basic evaluation (computes Acc, GA@n, CR)
python -m evaluation.evaluate \
    --predictions ./results/predictions.json \
    --labels ./data/math_rv.json \
    --output_dir ./results

# Full evaluation including OOR metric (requires original SP data)
python -m evaluation.evaluate \
    --predictions ./results/rv_predictions.json \
    --labels ./data/math_rv.json \
    --sp_predictions ./results/sp_predictions.json \
    --sp_labels ./data/math_sp.json \
    --output_dir ./results

The prediction file should be a JSON list where each entry has "id" and "answer" fields matching the label file format.

Evaluation (OpenCompass)

We use OpenCompass as the evaluation framework (many thanks to the excellent work produced by the OpenCompass team!). Please install OpenCompass first:

git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -e .

Then, copy the dataset and configuration files into your OpenCompass installation:

Copy the custom dataset classes:

cp evaluation/opencompass_config/mathrv.py <opencompass_root>/opencompass/datasets/
cp evaluation/opencompass_config/leetcoderv.py <opencompass_root>/opencompass/datasets/
cp evaluation/opencompass_config/mathsp.py <opencompass_root>/opencompass/datasets/
cp evaluation/opencompass_config/leetcodesp.py <opencompass_root>/opencompass/datasets/

Register the datasets in <opencompass_root>/opencompass/datasets/__init__.py:

from .mathrv import MATHRVDataset, MATHRVEvaluator
from .leetcoderv import LeetCodeRVDataset, LeetCodeRVEvaluator
from .mathsp import MATHSpDataset, MATHSpEvaluator
from .leetcodesp import LeetCodeSpDataset, LeetCodeSpEvaluator

Copy the dataset configuration files:

mkdir -p <opencompass_root>/configs/datasets/mathrv
mkdir -p <opencompass_root>/configs/datasets/leetcoderv
mkdir -p <opencompass_root>/configs/datasets/mathsp
mkdir -p <opencompass_root>/configs/datasets/leetcodesp

cp evaluation/opencompass_config/mathrv_gen.py <opencompass_root>/configs/datasets/mathrv/
cp evaluation/opencompass_config/leetcoderv_gen.py <opencompass_root>/configs/datasets/leetcoderv/
cp evaluation/opencompass_config/mathsp_gen.py <opencompass_root>/configs/datasets/mathsp/
cp evaluation/opencompass_config/leetcodesp_gen.py <opencompass_root>/configs/datasets/leetcodesp/

Copy the data files:

mkdir -p <opencompass_root>/data/mathrv
mkdir -p <opencompass_root>/data/leetcoderv
mkdir -p <opencompass_root>/data/mathsp
mkdir -p <opencompass_root>/data/leetcodesp

cp data/math_rv.json <opencompass_root>/data/mathrv/mathrv.jsonl
cp data/leetcode_rv.json <opencompass_root>/data/leetcoderv/leetcoderv.jsonl
cp data/math_sp.json <opencompass_root>/data/mathsp/mathsp.jsonl
cp data/leetcode_sp.json <opencompass_root>/data/leetcodesp/leetcodesp.jsonl

Run evaluation (using vLLM acceleration as an example):

cd <opencompass_root>

# Evaluate open-source models with vLLM
CUDA_VISIBLE_DEVICES=0,1 python run.py \
    --models hf_qwen2_5_7b_instruct \
    --datasets mathrv_gen leetcoderv_gen mathsp_gen leetcodesp_gen \
    --mode infer -a vllm

RV-Bench Leaderboard

Rankings are based on RV-Bench Acc (overall accuracy), which measures the overall exact match accuracy across all RVQs in both MATH-RV and LeetCode-RV. For the proposed evaluation metrics: Acc, GA@5, and CR, higher values are better; and for OOR, lower values are better. A ~ in the "Size" column indicates that the model is proprietary and its size is not publicly disclosed.

#	Model	Size	MATH-Sp Acc	MATH-RV Acc	MATH-RV GA@5	MATH-RV CR	MATH-RV OOR	LC-Sp Acc	LC-RV Acc	LC-RV GA@5	LC-RV CR	LC-RV OOR	RV-Bench Acc
1	o3-mini	~	97.39	92.52	82.61	87.83	6.09	82.61	77.57	61.74	67.83	6.09	85.05
2	DeepSeek-R1	671B	100.00	92.52	85.22	88.70	6.09	80.00	72.17	52.17	57.39	5.22	82.35
3	o1-mini	~	90.43	84.00	67.83	80.87	5.22	76.52	66.09	41.74	51.30	6.09	75.05
4	Gemini-2.0-Pro	~	92.17	84.17	71.30	78.26	8.70	72.17	60.17	34.78	42.61	8.70	72.17
5	DeepSeek-v3	671B	89.57	85.04	72.17	76.52	5.22	66.09	58.26	34.78	37.39	12.17	71.65
6	GLM-Zero-Preview	~	92.17	83.13	65.22	77.39	6.09	66.96	60.00	35.65	44.35	9.57	71.57
7	QwQ-32B-Preview	32B	91.30	83.83	60.87	79.13	5.22	62.61	58.96	30.43	42.61	7.83	71.40
8	Claude-3.5-Sonnet	~	88.70	80.35	63.48	73.04	6.09	70.43	61.39	35.65	42.61	8.70	70.87
9	Qwen2.5-Max	~	88.70	81.39	63.48	74.78	6.96	72.17	58.43	33.04	42.61	12.17	69.91
10	Qwen2.5-72B-It	72B	87.83	81.04	62.61	76.52	6.09	66.09	58.43	29.57	40.00	10.43	69.74
11	Qwen2.5-32B-It	32B	90.43	80.00	61.74	73.91	4.35	69.57	55.48	26.09	39.13	12.17	67.74
12	GLM-4-Plus	~	86.09	77.91	53.91	71.30	6.96	66.96	55.30	26.96	38.26	14.78	66.61
13	o1-preview	~	80.87	75.83	42.61	59.13	6.96	66.09	54.78	32.17	40.87	9.57	65.31
14	GPT-4o	~	83.48	76.70	57.39	63.48	6.09	61.74	50.09	20.00	32.17	13.04	63.40
15	Phi-4	14B	77.39	72.00	53.04	61.74	8.70	60.00	54.78	26.96	34.78	9.57	63.39

Citation

Please cite our paper if you include RV-Bench in your work:

@inproceedings{hong2026benchmarking,
    title = {Benchmarking LLMs’ Mathematical Reasoning with Unseen Random Variables Questions},
    author = {Hong, Zijin and Wu, Hao and Dong, Su and Dong, Junnan and Xiao, Yilin and Zhang, Yujing and Wang, Zhu and Huang, Feiran and Li, Linyi and Yang, Hongxia and Huang, Xiao},
    booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
    year = {2026}
}

Feel free to reach out via email if you need any help:

zijin.hong@connect.polyu.hk

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
evaluation		evaluation
generation		generation
question_functions		question_functions
slides		slides
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RV-Bench - Benchmarking LLMs via Random Variables

Setup

Environment

Usage

Question Generation

Evaluation (Standalone)

Evaluation (OpenCompass)

RV-Bench Leaderboard

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RV-Bench - Benchmarking LLMs via Random Variables

Setup

Environment

Usage

Question Generation

Evaluation (Standalone)

Evaluation (OpenCompass)

RV-Bench Leaderboard

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages