Skip to content

DEEP-PolyU/RV-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RV-Bench - Benchmarking LLMs via Random Variables

[2026/03] Our paper is available in the proceedings of AAAI2026 here.

[2026/02] Code and data are released.

[2026/01] Check our video presentation in Underline!

[2025/11] Our paper is accepted by AAAI2026!

We propose a novel framework RV-Bench for benchmarking LLMs' genuine mathematical reasoning capabilities via random variables, the paper is available here.

Framework

Setup

Environment

The GPU resources we use in our study is 2*A100-SXM4-80G with the corresponding CUDA version 12.2, please adjust your GPUs according to the model size.

# Clone the repository
git clone https://github.com/Rcrossmeister/RV-Bench.git
cd ./RV-Bench

# Create the conda environment
conda create -n rvbench python=3.10
conda activate rvbench

# Install the required packages
pip install -r requirements.txt

Usage

Question Generation

Use generation/generate.py to produce RV questions from the question functions. Each question function generates multiple random variable (RV) instantiations of a mathematical problem.

# Generate all RV questions (115 MATH + 115 LeetCode, 5 per group)
python -m generation.generate --source all --num_per_group 5 --seed 42 --output_dir ./data

# Generate only MATH-RV questions
python -m generation.generate --source math --num_per_group 5 --seed 42

# Generate only LeetCode-RV questions
python -m generation.generate --source leetcode --num_per_group 5 --seed 42

# Generate for a specific question function
python -m generation.generate --qf_id math_001 --num_per_group 10

The generated questions will be saved as JSON files in the --output_dir directory (default: ./data).

Evaluation (Standalone)

Use evaluation/evaluate.py to compute the four RV-Bench metrics (Acc, GA@n, CR, OOR) without any external dependencies.

# Basic evaluation (computes Acc, GA@n, CR)
python -m evaluation.evaluate \
    --predictions ./results/predictions.json \
    --labels ./data/math_rv.json \
    --output_dir ./results

# Full evaluation including OOR metric (requires original SP data)
python -m evaluation.evaluate \
    --predictions ./results/rv_predictions.json \
    --labels ./data/math_rv.json \
    --sp_predictions ./results/sp_predictions.json \
    --sp_labels ./data/math_sp.json \
    --output_dir ./results

The prediction file should be a JSON list where each entry has "id" and "answer" fields matching the label file format.

Evaluation (OpenCompass)

We use OpenCompass as the evaluation framework (many thanks to the excellent work produced by the OpenCompass team!). Please install OpenCompass first:

git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -e .

Then, copy the dataset and configuration files into your OpenCompass installation:

  1. Copy the custom dataset classes:
cp evaluation/opencompass_config/mathrv.py <opencompass_root>/opencompass/datasets/
cp evaluation/opencompass_config/leetcoderv.py <opencompass_root>/opencompass/datasets/
cp evaluation/opencompass_config/mathsp.py <opencompass_root>/opencompass/datasets/
cp evaluation/opencompass_config/leetcodesp.py <opencompass_root>/opencompass/datasets/
  1. Register the datasets in <opencompass_root>/opencompass/datasets/__init__.py:
from .mathrv import MATHRVDataset, MATHRVEvaluator
from .leetcoderv import LeetCodeRVDataset, LeetCodeRVEvaluator
from .mathsp import MATHSpDataset, MATHSpEvaluator
from .leetcodesp import LeetCodeSpDataset, LeetCodeSpEvaluator
  1. Copy the dataset configuration files:
mkdir -p <opencompass_root>/configs/datasets/mathrv
mkdir -p <opencompass_root>/configs/datasets/leetcoderv
mkdir -p <opencompass_root>/configs/datasets/mathsp
mkdir -p <opencompass_root>/configs/datasets/leetcodesp

cp evaluation/opencompass_config/mathrv_gen.py <opencompass_root>/configs/datasets/mathrv/
cp evaluation/opencompass_config/leetcoderv_gen.py <opencompass_root>/configs/datasets/leetcoderv/
cp evaluation/opencompass_config/mathsp_gen.py <opencompass_root>/configs/datasets/mathsp/
cp evaluation/opencompass_config/leetcodesp_gen.py <opencompass_root>/configs/datasets/leetcodesp/
  1. Copy the data files:
mkdir -p <opencompass_root>/data/mathrv
mkdir -p <opencompass_root>/data/leetcoderv
mkdir -p <opencompass_root>/data/mathsp
mkdir -p <opencompass_root>/data/leetcodesp

cp data/math_rv.json <opencompass_root>/data/mathrv/mathrv.jsonl
cp data/leetcode_rv.json <opencompass_root>/data/leetcoderv/leetcoderv.jsonl
cp data/math_sp.json <opencompass_root>/data/mathsp/mathsp.jsonl
cp data/leetcode_sp.json <opencompass_root>/data/leetcodesp/leetcodesp.jsonl
  1. Run evaluation (using vLLM acceleration as an example):
cd <opencompass_root>

# Evaluate open-source models with vLLM
CUDA_VISIBLE_DEVICES=0,1 python run.py \
    --models hf_qwen2_5_7b_instruct \
    --datasets mathrv_gen leetcoderv_gen mathsp_gen leetcodesp_gen \
    --mode infer -a vllm

RV-Bench Leaderboard

Rankings are based on RV-Bench Acc (overall accuracy), which measures the overall exact match accuracy across all RVQs in both MATH-RV and LeetCode-RV. For the proposed evaluation metrics: Acc, GA@5, and CR, higher values are better; and for OOR, lower values are better. A ~ in the "Size" column indicates that the model is proprietary and its size is not publicly disclosed.

# Model Size MATH-Sp Acc MATH-RV Acc MATH-RV GA@5 MATH-RV CR MATH-RV OOR LC-Sp Acc LC-RV Acc LC-RV GA@5 LC-RV CR LC-RV OOR RV-Bench Acc
1 o3-mini ~ 97.39 92.52 82.61 87.83 6.09 82.61 77.57 61.74 67.83 6.09 85.05
2 DeepSeek-R1 671B 100.00 92.52 85.22 88.70 6.09 80.00 72.17 52.17 57.39 5.22 82.35
3 o1-mini ~ 90.43 84.00 67.83 80.87 5.22 76.52 66.09 41.74 51.30 6.09 75.05
4 Gemini-2.0-Pro ~ 92.17 84.17 71.30 78.26 8.70 72.17 60.17 34.78 42.61 8.70 72.17
5 DeepSeek-v3 671B 89.57 85.04 72.17 76.52 5.22 66.09 58.26 34.78 37.39 12.17 71.65
6 GLM-Zero-Preview ~ 92.17 83.13 65.22 77.39 6.09 66.96 60.00 35.65 44.35 9.57 71.57
7 QwQ-32B-Preview 32B 91.30 83.83 60.87 79.13 5.22 62.61 58.96 30.43 42.61 7.83 71.40
8 Claude-3.5-Sonnet ~ 88.70 80.35 63.48 73.04 6.09 70.43 61.39 35.65 42.61 8.70 70.87
9 Qwen2.5-Max ~ 88.70 81.39 63.48 74.78 6.96 72.17 58.43 33.04 42.61 12.17 69.91
10 Qwen2.5-72B-It 72B 87.83 81.04 62.61 76.52 6.09 66.09 58.43 29.57 40.00 10.43 69.74
11 Qwen2.5-32B-It 32B 90.43 80.00 61.74 73.91 4.35 69.57 55.48 26.09 39.13 12.17 67.74
12 GLM-4-Plus ~ 86.09 77.91 53.91 71.30 6.96 66.96 55.30 26.96 38.26 14.78 66.61
13 o1-preview ~ 80.87 75.83 42.61 59.13 6.96 66.09 54.78 32.17 40.87 9.57 65.31
14 GPT-4o ~ 83.48 76.70 57.39 63.48 6.09 61.74 50.09 20.00 32.17 13.04 63.40
15 Phi-4 14B 77.39 72.00 53.04 61.74 8.70 60.00 54.78 26.96 34.78 9.57 63.39

Citation

Please cite our paper if you include RV-Bench in your work:

@inproceedings{hong2026benchmarking,
    title = {Benchmarking LLMs’ Mathematical Reasoning with Unseen Random Variables Questions},
    author = {Hong, Zijin and Wu, Hao and Dong, Su and Dong, Junnan and Xiao, Yilin and Zhang, Yujing and Wang, Zhu and Huang, Feiran and Li, Linyi and Yang, Hongxia and Huang, Xiao},
    booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
    year = {2026}
}

Feel free to reach out via email if you need any help:

zijin.hong@connect.polyu.hk

About

[AAAI2026] Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages