Vinayak Multistep Recursive Reasoning Benchmark (VMRRB)

A benchmark for evaluating advanced reasoning, recursive dependency resolution, and robustness capabilities of large language models in dynamic, noisy, and structurally challenging environments.

Benchmark Configuration

Attribute	Specification
Version	01
Difficulty Level	Normal
Type	Reduced Evaluation Set
Question Count	1,000 recursively dependent questions
Purpose	Public evaluation and baseline benchmarking

Results: Reduced Evaluation Set (1,000 Questions)

Rank	Model Name	Score
1	Google Gemini 3.1 Pro	97.60%
2	DeepSeek Pro v4	79.10%
3	Kimi K2.6 Thinking	78.90%
4	Anthropic Claude Sonnet 4.6	75.20%
5	GPT-5.4 Thinking	68.00%

Benchmark File Structure

vmrrb-benchmark/
├── README.md              ← methodology, benchmark specification
├── scripts/               ← scoring and evaluation scripts
├── test_prompt/           ← prompt used for testing and question & answer sheets 
├── results/               ← detailed benchmark reports
├── raw_data_ai_response/  ← raw AI model outputs
├── future_work/           ← samples of future benchmark question & answer sheets (100k Q&A added)
└── Leaderboard.csv        ← leaderboard rankings

Detailed benchmark reports, raw model outputs, and evaluation summaries are available in the results/ and raw_data_ai_response/ directories.

Overview

The Vinayak Multistep Recursive Reasoning Benchmark (VMRRB) is designed to evaluate advanced reasoning, recursive dependency resolution, and robustness capabilities in dynamic, noisy, and structurally challenging environments.

The benchmark architecture is designed to scale toward extremely large recursive workloads, including theoretical configurations containing trillions of interdependent questions. It dynamically generates a new recursive dataset for every evaluation run. No static database or fixed question set is used, ensuring each run is unique.

Preliminary evaluations on contemporary frontier language models show substantial performance degradation even on reduced benchmark configurations containing 1,000 recursively dependent questions.

The benchmark is specifically designed to test:

Recursive multistep reasoning
Dependency resolution across interconnected problems
Long-chain consistency
Robust semantic parsing
Execution efficiency under recursive workloads
Reliable computation under challenging input conditions

The benchmark intentionally relies on relatively simple arithmetic primitives. The primary difficulty emerges from recursive dependency resolution, noisy semantic parsing, execution ordering, long-chain consistency, and strict instruction-following constraints rather than advanced mathematical complexity.

The objective is not only to measure reasoning accuracy, but also to evaluate whether an AI system can maintain speed, consistency, correctness, and structural reliability simultaneously.

Introduction

Modern frontier AI systems demonstrate strong performance on conventional reasoning benchmarks but frequently degrade under recursive dependency resolution, noisy semantic parsing, and long-chain execution constraints.

VMRRB is designed to evaluate whether a model can maintain correctness, consistency, and structured execution while solving recursively interconnected (mathematical) tasks embedded within challenging environments.

The benchmark combines:

Recursive dependency chains
Arithmetic reasoning
Semantic noise injection
Structured parsing constraints
Strict output formatting requirements

Each question may depend on answers from one or more previous questions. Models must correctly resolve dependency chains recursively before computing the final result.

In addition, benchmark prompts intentionally contain random irrelevant tokens and corrupted text fragments. Models are expected to recover the intended mathematical meaning while ignoring semantically meaningless content.

Benchmark Objectives

Below is a structured list of the capabilities benchmark aims to evaluate:

Main Category	What Is Being Tested
Multistep Recursive Reasoning	Solving recursively dependent, interconnected problems through chained reasoning
Dependency & Structural Resolution	Resolving nested references and understanding hierarchical dependency structure in correct evaluation order
Mathematical & Compositional Reasoning	Correctly combining arithmetic, symbolic substitution, parsing, memory, and recursion during computation
Robust Semantic Parsing	Recovering intended mathematical meaning from noisy, ambiguous, corrupted, or adversarial text while ignoring irrelevant content
Context & Long-Chain Consistency	Retaining intermediate results and ensuring consistency across deep reasoning chains
Instruction & Rule Following	Strictly adhering to procedural, execution, and output-format constraints
Recursive Planning & Execution	Planning dependency resolution strategy and executing computations systematically

The benchmark is intended to stress-test reasoning systems beyond conventional single-step mathematical evaluation tasks.

Benchmark Design

Mathematical Operations

The benchmark uses a constrained set of arithmetic operations designed to isolate reasoning, dependency resolution, and semantic robustness from advanced mathematical complexity.

Supported operations include:

Operation	Description
Addition	Arithmetic summation
Subtraction	Arithmetic difference
Multiplication	Arithmetic product
Division	Arithmetic quotient
Rounding	Numeric rounding transformation
Floor Function	Largest integer less than or equal to input
Ceiling Function	Smallest integer greater than or equal to input
Modulo	Remainder-based arithmetic operation

The benchmark intentionally relies on relatively simple arithmetic primitives. The primary challenge arises from recursive dependency resolution, noisy semantic parsing, long-chain consistency, and strict instruction-following constraints rather than advanced mathematical difficulty.

Recursive Dependency Structure

Questions are recursively interconnected through dependency references.

Example:

Question [10]: Compute the product of [ Answer 4 ] and 3.56

To solve Question 10, a model must first recursively solve Question 4 before computing the final answer.

The benchmark evaluates whether models can:

Build dependency chains correctly
Resolve nested references in proper order
Preserve intermediate state across long reasoning sequences
Maintain consistency throughout recursive execution

Dynamic Dataset Generation

The benchmark uses dynamically generated recursive datasets rather than relying on a fixed static database.

For each benchmark run, a new set of recursively interconnected questions, dependency structures, arithmetic compositions, and semantic noise patterns can be generated automatically. This ensures that individual evaluation runs remain structurally unique and reduces the risk of memorization-based optimization or overfitting to static benchmark content.

The dynamic generation framework enables:

Unique recursive dependency graphs for each run
Variable noise injection patterns
Adjustable dependency depth and structural complexity
Scalable automatic benchmark creation
Reduced dataset memorization risk
More reliable evaluation of genuine reasoning capability

Because benchmark instances can be generated procedurally, VMRRB is designed to support scalable evaluation workloads ranging from small public benchmark subsets to extremely large recursive reasoning stress tests.

Noise Injection Strategy

Benchmark prompts intentionally include random irrelevant tokens, malformed fragments, and semantically meaningless text.

Example:

Determine the sum of xQAbc [ Answer 5 ] and two point three yzLm

Models are expected to:

Recover the intended mathematical structure
Ignore semantically irrelevant noise
Preserve valid operators and dependency references
Avoid corruption of execution flow due to adversarial text

Output Constraints

Models must strictly follow predefined output formatting rules.

Outputs are evaluated not only for mathematical correctness, but also for:

Correct dependency resolution
Proper execution ordering
Output structure compliance
Consistency across all generated answers

Evaluation and Scoring Methods

Model outputs are evaluated through end-to-end answer correctness against the benchmark ground-truth dataset.

Numerical answers are compared using configurable decimal-place tolerance matching to reduce sensitivity to insignificant floating-point formatting differences.

Current benchmark evaluations use a 1-decimal-place comparison threshold.

Metric	Description
Accuracy	Percentage of correctly solved questions

Because benchmark questions are recursively interconnected, recursive reasoning accuracy, dependency resolution, parsing robustness, and execution consistency are evaluated implicitly through final-answer correctness rather than through independently scored sub-metrics.

Dataset

The benchmark dataset is synthetically generated and consists of recursively interconnected mathematical problems containing controlled semantic noise and dependency references.

Difficulty levels scale through:

Dependency depth
Recursive graph complexity
Noise density
Total question count
Structural dependency length

The benchmark supports scalable configurations ranging from small evaluation subsets to extremely large recursive workloads.

Future Work

Future benchmark expansions may include:

Larger recursive dependency graphs
Multilingual challenging reasoning tasks
Symbolic reasoning extensions
Adaptive difficulty scaling
Automated benchmark generation pipelines
Expanded robustness evaluation methodologies

VMRRB already includes larger-scale benchmark configurations containing 10K, 100K, and 1M recursively interconnected questions. Preliminary evaluations indicate substantial performance and scalability challenges for current frontier language models on these larger recursive workloads.

Sample question and answer sheets for future large-scale benchmark configurations are available in the future_work/ directory. The future_work/ directory currently includes 100K-question benchmark samples. These files demonstrate large-scale recursive dependency structures and scalable benchmark generation capabilities for future benchmark expansions.

Future development may involve collaboration with frontier AI research labs to evaluate large-scale recursive reasoning capabilities, scalability limits, and robustness under increasingly complex dependency structures.

The benchmark will continue evolving to evaluate increasingly complex recursive reasoning and execution capabilities in large-scale AI systems.

Full Benchmark Configuration

The complete VMRRB framework is designed to support significantly larger recursive dependency graphs.

Full-scale theoretical configuration

Attribute	Specification
Difficulty Level	Hard
Type	Full Evaluation Set
Question Count	10 trillion recursively interconnected questions
Purpose	Large-scale recursive reasoning stress testing and scalability research

The 10-trillion-question configuration is intended as an extreme scalability target for evaluating recursive dependency resolution, long-chain consistency, memory robustness, and execution planning under massive recursive workloads.

Due to current practical limitations involving context windows, inference cost, execution time, and memory constraints, full-scale evaluations at this size are currently theoretical and experimental rather than standard public benchmark runs.

Raw Model Outputs

Raw AI model responses used during benchmark evaluation are provided for transparency, reproducibility, and independent analysis.

Model	Response Link
Google Gemini 3.1 Pro	View Response
DeepSeek Pro v4	View Response
Kimi K2.6 Thinking	View Response
Anthropic Claude Sonnet 4.6	View Response
GPT-5.4 Thinking	View Response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vinayak Multistep Recursive Reasoning Benchmark (VMRRB)

Benchmark Configuration

Results: Reduced Evaluation Set (1,000 Questions)

Benchmark File Structure

Overview

Table of Contents

Introduction

Benchmark Objectives

Benchmark Design

Mathematical Operations

Recursive Dependency Structure

Dynamic Dataset Generation

Noise Injection Strategy

Output Constraints

Evaluation and Scoring Methods

Dataset

Future Work

Full Benchmark Configuration

Full-scale theoretical configuration

Raw Model Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
future_work		future_work
raw_data_ai_response		raw_data_ai_response
results		results
scripts		scripts
test_prompt		test_prompt
LICENSE		LICENSE
Leaderboard.csv		Leaderboard.csv
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Vinayak Multistep Recursive Reasoning Benchmark (VMRRB)

Benchmark Configuration

Results: Reduced Evaluation Set (1,000 Questions)

Benchmark File Structure

Overview

Table of Contents

Introduction

Benchmark Objectives

Benchmark Design

Mathematical Operations

Recursive Dependency Structure

Dynamic Dataset Generation

Noise Injection Strategy

Output Constraints

Evaluation and Scoring Methods

Dataset

Future Work

Full Benchmark Configuration

Full-scale theoretical configuration

Raw Model Outputs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages