Standardized QA evaluation framework for Theory-of-Mind and related benchmarks.
All included QA datasets are already normalized before evaluation.
Each sample should look like:
{
"story": "context text",
"question": "question text",
"answer": {
"correct_answers": ["answer text"],
"wrong_answers": ["wrong option 1", "wrong option 2"]
},
"meta": {
"id": "optional-sample-id"
}
}Rules:
correct_answersis always a list.wrong_answersis empty for open QA.- If
wrong_answersis non-empty, the sample is treated as choice QA. - Dataset-specific grouping fields live in
meta.
There is one shared evaluation pipeline:
predictmetrictables
Shared logic in src/ handles:
- loading normalized data
- deterministic option shuffle
- two unified English prompt templates
- free-text prediction calls via
ContentClient(create) - structured LLM judge calls via
StructureClient(parse) - saving
prediction.jsonlandmetrics.json
Dataset-specific logic stays in tasks/<dataset>/metrics.py.
BigToMEmoBenchFanToMHiToMSocialIQAToMBench
- Open QA: model outputs answer text.
- Single-choice QA: model outputs one option letter.
- Multi-choice QA: model outputs a list of option letters.
- All correctness is decided by the judge stage.
For choice QA, prediction records include the shuffled option mapping and gold letters so results are reproducible.
ToMEval/
|-- experiment_config.yaml
|-- run_all.py
|-- run_feedback.py
|-- run_filter.py
|-- requirements.txt
|-- src/
| |-- evaluation/
| | |-- __init__.py
| | |-- pipeline.py
| | |-- data.py
| | |-- prediction.py
| | |-- judge.py
| | |-- judge_schema.py
| | |-- prompts.py
| | |-- storage.py
| | |-- paths.py
| | |-- metrics.py
| | |-- task_metrics.py
| | `-- types.py
| |-- llm/
| | |-- client.py
| | `-- ...
| `-- dataloader/
|-- tasks/
| `-- <dataset>/
| |-- config.yaml
| |-- metrics.py
| `-- run.py
|-- datasets/ # 标准化后的测试数据集
|-- train_datasets/ # 合成的训练数据集
|-- feedback/ # 数据合成模块(bad case → 诊断 → 合成)
| |-- config.yaml
| |-- README.md
| `-- ...
|-- filter/ # 数据质量评估模块(V3 飞轮)
| |-- config.yaml
| |-- README.md
| |-- eval/
| |-- repair/
| `-- ...
|-- report/ # 报告生成脚本
| |-- config.yaml
| |-- generate_dataset_tables.py
| |-- generate_summary.py
| `-- generate_html_report.py
|-- tables/ # 生成的表格和报告
|-- results/ # 评测结果
|-- docs/ # 文档
`-- logs/ # 日志文件
Install dependencies:
pip install -r requirements.txtSet model and path config in experiment_config.yaml, then run:
python run_all.pyOr run one dataset:
python tasks/BigToM/run.pyRun only prediction:
python run_all.py --stage predictRe-run only metrics on an existing experiment:
python run_all.py --stage metric --exp-dir 20260515_120000Generate tables:
python report/generate_dataset_tables.py
python report/generate_summary.py- Normalize the dataset into the standard schema.
- Add
tasks/<dataset>/config.yaml. - Add
tasks/<dataset>/metrics.pyif the dataset needs custom grouped metrics. - Add
tasks/<dataset>/run.py. - Register the dataset name in
run_all.py.