Source code for Phyground, a benchmark for the physical plausibility of text+image-to-video (ti2v) generations. This repo contains everything needed to reproduce the benchmark end-to-end: prompt curation, VLM-as-judge evaluation, LoRA judge training, and the Flask web app used to collect human ratings.
Companion artifacts:
| Artifact | Where |
|---|---|
| Dataset | π€ NU-World-Model-Embodied-AI/phyground |
| judge model | π€ Phyjudge-9B |
| Paper (rubric, methodology, results) | PhyGround |
Each video is rated on a 1β5 ordinal scale along three families of dimensions:
General (always rated, 3 dims)
| Code | Key | Question |
|---|---|---|
| G1 | persistence |
Do objects keep consistent appearance, shape, and existence? |
| G2 | PTV |
Is the temporal order of physical events plausible? |
| G3 | SA |
Does the video align with the text prompt? |
Physical-law sub-rubric (13 laws, only the laws that apply to the prompt are rated)
| Domain | Laws |
|---|---|
| A. Solid-Body Mechanics | gravity, inertia, momentum, impenetrability, collision, material |
| B. Fluid Dynamics | buoyancy, displacement, flow_dynamics, boundary_interaction, fluid_continuity |
| C. Optics | reflection, shadow |
The single source of truth for these definitions (English + Chinese, plus
sub-question decompositions used for chain-of-thought judging) is
evals/physics_criteria.py and
evals/sub_questions.py.
dataprocessing/
common/ # Vertex AI / OpenAI client helpers, video-id utilities,
# batched-pipeline runner with quality checks
refine/ # Prompt-set construction:
# enhance_prompts_physics.py β Gemini-aided physics-aware rewriting
# gen_humaneval_set.py β sampler for the human-eval subset
# (disabled stub in this release)
evals/
vlm_eval.py # CLI entry point β scores one --video_dir of *.mp4
vlm_common.py # Backends: OpenAI-compatible vLLM, Gemini, GPT,
# Claude (Vertex AI), plus response parsing +
# summary printing
eval_types.py # Typed result containers for VLM-as-judge runs
physics_criteria.py # 13 physical laws (EN + ZH) + human-eval rubric definitions
sub_questions.py # Per-law observational sub-questions for CoT / SubQ prompts
prompts/ # 5 judge prompt templates + PromptConfig loader:
# default.yaml β direct 1-5 score, JSON-only output
# cotnosubq.yaml β chain-of-thought, no sub-questions
# cot-subq.yaml β CoT + observational sub-questions
# subq+answer.yaml β sub-questions answered yes/no/uncertain
# subq+human.yaml β human-style sub-questions
human_eval/ # Flask app: assignment, rating UI, coverage reports,
# alignment checks, tests, templates, static assets
scripts/
serve_judge.sh # Launch phyjudge LoRA on a vLLM OpenAI-compatible server
score_videos.sh # One-click: serve_judge.sh + evals.vlm_eval (local LoRA path)
score_videos_api.sh # Same, but routes to Gemini / GPT / Claude cloud APIs
judge_training/
data/ # Build ms-swift SFT data from raw judgement logs:
# schema.py, sample.py, naming.py, prompt_config.py
# build_records_from_db.py β aggregate human ratings
# build_from_claude_cot.py β convert Claude CoT logs
# build_swift_data.py β write/split/validate JSONL
The full five-stage pipeline (prompt curation β video generation β VLM-as-judge evaluation β human annotation β judge LoRA training) is documented in
PIPELINE.md. The section below covers the two stages most users will run themselves: generating videos for the benchmark prompts, and scoring them.
Use any ti2v model on the curated prompt set;
the dataset card lists the eight models we ran (wan2.2-ti2v-5b,
ltx-2-19b-dev, cosmos-predict2.5-{2b,14b}, veo-3.1,
wan2.2-i2v-a14b, omniweaving, ltx-2.3-22b-dev).
Each prompt in prompts/phyground.json ships with a corresponding first-
frame conditioning image under
first_images/
on the HF dataset β feed (text_prompt, first_image) to your ti2v model
and save the result as videos/<video_id>.mp4, where <video_id> matches
the video field of the prompts JSON entry. The scorer pairs videos to
prompts by that filename stem.
Three commands to a scored JSON.
# 1. Install the eval runner (one-stop extra: all backends + HF CLI).
pip install -e ".[eval]"
# Plus a system-level ffmpeg if you'll use the local vLLM judge:
# apt-get install ffmpeg / brew install ffmpeg
# 2. Pull the benchmark prompts (250 entries) and first-frame images.
huggingface-cli download --repo-type dataset \
NU-World-Model-Embodied-AI/phyground \
--include "prompts/phyground.json" "first_images/*" \
--local-dir ./data
# β data/prompts/phyground.json
# β data/first_images/*.png
# 3. Score every videos/*.mp4 with the released phyjudge LoRA via vLLM.
pip install "vllm>=0.6"
bash scripts/score_videos.sh \
--video_dir ./videos \
--save_path ./scores.jsonThe wrapper starts vLLM in the background (base Qwen/Qwen3.5-9B + LoRA
adapter NU-World-Model-Embodied-AI/phyjudge-9B, as recorded in the
model card),
waits for /health, runs python -m evals.vlm_eval against every
*.mp4 under --video_dir, and tears the server down on exit. Override
the base or adapter with PHYJUDGE_BASE=β¦ / PHYJUDGE_LORA=β¦ if you've
mirrored them locally.
scores.json schema:
For reproducing the closed-source baselines you don't need a GPU β use
scripts/score_videos_api.sh, which talks to a cloud model directly:
# Gemini (AI Studio key β fastest path)
GEMINI_API_KEY=β¦ bash scripts/score_videos_api.sh \
--video_dir ./videos --save_path ./scores_gemini.json
# OpenAI GPT
JUDGE_BACKEND=gpt OPENAI_API_KEY=β¦ bash scripts/score_videos_api.sh \
--video_dir ./videos --save_path ./scores_gpt.json
# Claude on Vertex AI (uses gcloud default credentials)
JUDGE_BACKEND=claude GCP_PROJECT=my-gcp-project \
bash scripts/score_videos_api.sh \
--video_dir ./videos --save_path ./scores_claude.jsonThe closed-source backends sample N frames per video (default 32) and
send them as images. Override the model name (JUDGE_MODEL=gpt-5.4,
JUDGE_MODEL=gemini-3.1-pro-preview, β¦), prompt template
(PROMPT_CONFIG=cotnosubq.yaml), or any other evals.vlm_eval flag by
passing it after the script name β see python -m evals.vlm_eval --help.
The five prompt templates under evals/prompts/ are A/B-comparable: they
share the same scoring keys but differ in whether they elicit
chain-of-thought reasoning and/or intermediate yes/no answers to per-law
sub-questions. The released phyjudge LoRA was fine-tuned against
default.yaml's training_prompts, which is why scripts/score_videos.sh
passes --use_training_prompts by default.
If you just want to score videos with the released judge, jump to the
VLM-as-judge evaluation section above β scripts/score_videos.sh is the
one-click runner. The transformers/peft path documented on the
model card
remains supported for users who'd rather load the LoRA in-process instead
of going through vLLM.
The repo ships a pyproject.toml with extras grouped by
use case, so you only install what you need.
# Minimum: just the prompt-template loader and project source
pip install -e .
# Pick any subset of the extras:
pip install -e ".[eval]" # One-stop for scripts/score_videos*.sh
# (all backend clients + decord + HF CLI)
pip install -e ".[web]" # Flask annotation app
pip install -e ".[inference]" # In-process judge inference (transformers, peft)
pip install -e ".[training]" # ms-swift + deepspeed for judge fine-tuning
pip install -e ".[gemini,claude,openai]" # Per-API subsets, if you don't want [eval]
pip install -e ".[test]" # pytest for the human-eval tests
# Or grab everything in one go
pip install -e ".[all]"Reproducing the table from the paper requires the dataset (videos + human ratings) and one of the released judges. The steps are:
- Pull the dataset from Hugging Face (link above) so that
data/{prompts/phyground.json,first_images/,annotations/}exist locally (see the VLM-as-judge evaluation section). - Run the judge of your choice on every (video, prompt-template) pair:
scripts/score_videos.shfor the released LoRA, orscripts/score_videos_api.shfor a closed-source baseline. VaryPROMPT_CONFIG=β¦acrossdefault.yaml,cotnosubq.yaml,cot-subq.yaml,subq+answer.yaml,subq+human.yamlto reproduce the prompt-template ablations. - Aggregate per-model means and compute agreement against the human
ratings in
data/annotations/annotator_*.json.
The released LoRA adapter targets the default.yaml template (and that
template's training_prompts block β which is why
scripts/score_videos.sh always passes --use_training_prompts).
Chain-of-thought and sub-question variants exist for ablations and
require re-running the relevant tables under a different
PROMPT_CONFIG.
Paper: arXiv:2605.10806
@misc{lin2026phygroundbenchmarkingphysicalreasoning,
title={PhyGround: Benchmarking Physical Reasoning in Generative World Models},
author={Juyi Lin and Arash Akbari and Yumei He and Lin Zhao and Haichao Zhang and Arman Akbari and Xingchen Xu and Zoe Y. Lu and Enfu Nan and Hokin Deng and Edmund Yeh and Sarah Ostadabbas and Yun Fu and Jennifer Dy and Pu Zhao and Yanzhi Wang},
year={2026},
eprint={2605.10806},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.10806},
}
{ "meta": { "evaluator": "qwen9b", "video_model": "videos", ... }, "num_videos": 250, "general_dimensions": ["SA", "PTV", "persistence"], "results": [ { "video": "ball_fall_0001", "SA": 4, "PTV": 5, "persistence": 5, "general_avg": 4.67, "physical": { "laws": { "gravity": { "score": 4, "status": "scored" } }, "avg": 4.0, "coverage": 1.0 }, "prompt": "...", "physical_laws": ["gravity", "collision"] } ] }