ToMEval

Standardized QA evaluation framework for Theory-of-Mind and related benchmarks.

What This Repo Assumes

All included QA datasets are already normalized before evaluation.

Each sample should look like:

{
  "story": "context text",
  "question": "question text",
  "answer": {
    "correct_answers": ["answer text"],
    "wrong_answers": ["wrong option 1", "wrong option 2"]
  },
  "meta": {
    "id": "optional-sample-id"
  }
}

Rules:

correct_answers is always a list.
wrong_answers is empty for open QA.
If wrong_answers is non-empty, the sample is treated as choice QA.
Dataset-specific grouping fields live in meta.

Evaluation Design

There is one shared evaluation pipeline:

predict
metric
tables

Shared logic in src/ handles:

loading normalized data
deterministic option shuffle
two unified English prompt templates
free-text prediction calls via ContentClient (create)
structured LLM judge calls via StructureClient (parse)
saving prediction.jsonl and metrics.json

Dataset-specific logic stays in tasks/<dataset>/metrics.py.

Included Tasks

BigToM
EmoBench
FanToM
HiToM
SocialIQA
ToMBench

Output Behavior

Open QA: model outputs answer text.
Single-choice QA: model outputs one option letter.
Multi-choice QA: model outputs a list of option letters.
All correctness is decided by the judge stage.

For choice QA, prediction records include the shuffled option mapping and gold letters so results are reproducible.

Repo Layout

ToMEval/
|-- experiment_config.yaml
|-- run_all.py
|-- run_feedback.py
|-- run_filter.py
|-- requirements.txt
|-- src/
|   |-- evaluation/
|   |   |-- __init__.py
|   |   |-- pipeline.py
|   |   |-- data.py
|   |   |-- prediction.py
|   |   |-- judge.py
|   |   |-- judge_schema.py
|   |   |-- prompts.py
|   |   |-- storage.py
|   |   |-- paths.py
|   |   |-- metrics.py
|   |   |-- task_metrics.py
|   |   `-- types.py
|   |-- llm/
|   |   |-- client.py
|   |   `-- ...
|   `-- dataloader/
|-- tasks/
|   `-- <dataset>/
|       |-- config.yaml
|       |-- metrics.py
|       `-- run.py
|-- datasets/                  # 标准化后的测试数据集
|-- train_datasets/            # 合成的训练数据集
|-- feedback/                  # 数据合成模块（bad case → 诊断 → 合成）
|   |-- config.yaml
|   |-- README.md
|   `-- ...
|-- filter/                    # 数据质量评估模块（V3 飞轮）
|   |-- config.yaml
|   |-- README.md
|   |-- eval/
|   |-- repair/
|   `-- ...
|-- report/                    # 报告生成脚本
|   |-- config.yaml
|   |-- generate_dataset_tables.py
|   |-- generate_summary.py
|   `-- generate_html_report.py
|-- tables/                    # 生成的表格和报告
|-- results/                   # 评测结果
|-- docs/                      # 文档
`-- logs/                      # 日志文件

Quick Start

Install dependencies:

pip install -r requirements.txt

Set model and path config in experiment_config.yaml, then run:

python run_all.py

Or run one dataset:

python tasks/BigToM/run.py

Run only prediction:

python run_all.py --stage predict

Re-run only metrics on an existing experiment:

python run_all.py --stage metric --exp-dir 20260515_120000

Generate tables:

python report/generate_dataset_tables.py
python report/generate_summary.py

Adding a New Dataset

Normalize the dataset into the standard schema.
Add tasks/<dataset>/config.yaml.
Add tasks/<dataset>/metrics.py if the dataset needs custom grouped metrics.
Add tasks/<dataset>/run.py.
Register the dataset name in run_all.py.

See docs/add_new_dataset.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToMEval

What This Repo Assumes

Evaluation Design

Included Tasks

Output Behavior

Repo Layout

Quick Start

Adding a New Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
docs		docs
feedback		feedback
filter		filter
report		report
src		src
tables		tables
tasks		tasks
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
download_datasets.sh		download_datasets.sh
experiment_config.yaml		experiment_config.yaml
requirements.txt		requirements.txt
run_all.py		run_all.py
run_feedback.py		run_feedback.py
run_filter.py		run_filter.py

Folders and files

Latest commit

History

Repository files navigation

ToMEval

What This Repo Assumes

Evaluation Design

Included Tasks

Output Behavior

Repo Layout

Quick Start

Adding a New Dataset

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages