DeltaTok compresses the frame-to-frame change in vision foundation model features into a single delta token, enabling DeltaWorld to efficiently generate diverse plausible futures.
All models operate at 512x512 resolution with a frozen DINOv3 ViT-B backbone. The released DeltaTok and DeltaWorld are trained on Kinetics-700, while the paper uses a larger dataset. See Training & Evaluation and Example Training Resources for reproduction.
Evaluation heads for downstream tasks:
| Task | Dataset | Metric | Download |
|---|---|---|---|
| Segmentation | VSPW | mIoU: 58.4 | |
| Segmentation | Cityscapes | mIoU: 70.5 | |
| Depth | KITTI | RMSE: 2.79 | |
| RGB | ImageNet | visualization only |
ViT-B encoder and decoder trained on Kinetics-700. Reconstruction quality is measured by applying downstream task heads to the reconstructed features.
| Horizon | VSPW mIoU (↑) | Cityscapes mIoU (↑) | KITTI RMSE (↓) |
|---|---|---|---|
| Short (1 frame) | 58.6 | 69.6 | 2.78 |
| Mid (3 frames)* | 58.5 | 67.9 | 2.86 |
*Parallel encoding from ground-truth frames with autoregressive decoding from previous reconstructions.
ViT-B predictor trained on Kinetics-700. Prediction quality is measured by applying downstream task heads to the predicted features. Cells report best-of-20 with mean in parentheses. best selects the sample with lowest DINOv3-feature loss to ground truth; mean averages DINOv3 features across all samples before evaluation.
| Method | Horizon | VSPW mIoU (↑) | Cityscapes mIoU (↑) | KITTI RMSE (↓) |
|---|---|---|---|---|
| Copy last (lower bound) | Short (1 frame) | 51.2 | 53.5 | 3.76 |
| DeltaWorld | Short (1 frame) | 56.3 (54.2) | 66.2 (64.2) | 2.95 (3.32) |
| Copy last (lower bound) | Mid (3 frames) | 44.3 | 39.6 | 4.86 |
| DeltaWorld | Mid (3 frames) | 51.5 (46.6) | 55.3 (49.5) | 3.71 (4.74) |
Requires Miniconda (or Anaconda), a Weights & Biases account for logging, and a Hugging Face account. Accept the license at facebook/dinov3-vitb16-pretrain-lvd1689m so the gated DINOv3 ViT-B backbone downloads automatically on first run.
conda create -n deltatok python=3.14.2
conda activate deltatok
pip install -r requirements.txt
wandb login
hf auth login
cp .env.example .envPrepare Kinetics-700 to train from scratch, and any of VSPW, Cityscapes, or KITTI for evaluation metrics and visualizations on that dataset. For each dataset you prepare, set the corresponding *_ROOT path in .env to the absolute path of the downloaded dataset directory.
mkdir -p kinetics/train
wget -i https://s3.amazonaws.com/kinetics/700_2020/train/k700_2020_train_path.txt -P k700_tars/
for f in k700_tars/*.tar.gz; do tar -xzf "$f" -C kinetics/train; donePre-extracted frames (as a directory of frame folders or zip archives) are also supported for faster data loading. See
datasets/kinetics.pyfor details.
pip install gdown
gdown "https://drive.google.com/file/d/14yHWsGneoa1pVdULFk7cah3t-THl7yEz/view?usp=sharing" --fuzzy
tar -xf VSPW_dataset.tar # extracts to VSPW/If
gdownfails due to rate limiting, downloadVSPW_dataset.tarmanually from the Google Drive link.
Requires registration at the Cityscapes website. Set CITYSCAPES_USERNAME and CITYSCAPES_PASSWORD environment variables for headless servers, or csDownload will prompt interactively.
pip install cityscapesscripts
mkdir -p cityscapes
csDownload -d cityscapes gtFine_trainvaltest.zip leftImg8bit_sequence_trainvaltest.zip
cd cityscapes && unzip -q gtFine_trainvaltest.zip && unzip -q leftImg8bit_sequence_trainvaltest.zip && cd ..wget https://s3.eu-central-1.amazonaws.com/avg-kitti/data_depth_annotated.zip
unzip data_depth_annotated.zip -d kitti && rm data_depth_annotated.zip
for drive in 2011_09_26_drive_{0002,0009,0013,0020,0023,0027,0029,0036,0046,0048,0052,0056,0059,0064,0084,0086,0093,0096,0101,0106,0117} 2011_09_28_drive_0002 2011_09_29_drive_0071 2011_09_30_drive_{0016,0018,0027} 2011_10_03_drive_{0027,0047}; do
wget -P kitti "https://s3.eu-central-1.amazonaws.com/avg-kitti/raw_data/${drive}/${drive}_sync.zip"
unzip -o -d kitti "kitti/${drive}_sync.zip" && rm "kitti/${drive}_sync.zip"
doneTraining and evaluation use Lightning CLI. To get evaluation metrics and visualizations on a dataset, download the pre-trained task head for that dataset and set the corresponding *_HEAD_PATH in .env to the absolute path of the downloaded file.
The effective batch size should be 1024 for both DeltaTok and DeltaWorld. It's the product of four parameters:
--data.batch_size × --trainer.devices × --trainer.num_nodes × --trainer.accumulate_grad_batches
The default config reaches this on a single node with 8 GPUs at per-GPU batch size 128 and no gradient accumulation; adjust any of the four parameters to fit your hardware. See Example Training Resources for the configurations we used for each stage.
Stage 1: Pre-train at 256px
python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
--data.frame_size=256 \
--trainer.max_steps=1000000Stage 2: High-resolution fine-tune at 512px
--model.ckpt_path loads model weights only; optimizer state and step counter reset.
python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
--model.lr=1e-4 \
--trainer.max_steps=500000 \
--model.ckpt_path=path/to/stage1/last.ckptStage 3-4: LR cooldowns
--ckpt_path resumes full training state (model weights, optimizer state, step counter).
# Stage 3
python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
--model.lr=1e-5 \
--trainer.max_steps=550000 \
--ckpt_path=path/to/stage2/last.ckpt
# Stage 4
python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
--model.lr=1e-6 \
--trainer.max_steps=600000 \
--ckpt_path=path/to/stage3/last.ckptRequires a DeltaTok checkpoint: either the released one (pytorch_model.bin) or one from your own training (last.ckpt).
python main.py fit -c configs/deltaworld_vitb_dinov3_vitb_kinetics.yaml \
--model.network.tokenizer.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin \
--trainer.max_steps=300000LR cooldown
python main.py fit -c configs/deltaworld_vitb_dinov3_vitb_kinetics.yaml \
--model.lr=1e-5 \
--trainer.max_steps=305000 \
--ckpt_path=path/to/deltaworld/last.ckptDeltaTok
python main.py validate -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
--model.ckpt_path=path/to/deltatok-kinetics/pytorch_model.binDeltaWorld
Requires both DeltaTok and DeltaWorld checkpoints.
python main.py validate -c configs/deltaworld_vitb_dinov3_vitb_kinetics.yaml \
--model.ckpt_path=path/to/deltaworld-kinetics/pytorch_model.bin \
--model.network.tokenizer.ckpt_path=path/to/deltatok-kinetics/pytorch_model.binTraining times and memory are measured on NVIDIA H200 GPUs. The configurations below are examples; any setup that reaches the target effective batch size works.
| Stage | Resolution | LR | Steps | GPUs | Batch/GPU | GPU Memory | Time |
|---|---|---|---|---|---|---|---|
| 1. Pre-train | 256 | 1e-3 | 1M | 8 | 128 | 65 GB | 82h |
| 2. Hi-res fine-tune | 512 | 1e-4 | 500k | 16 | 64 | 109 GB | 89h |
| 3. LR cooldown | 512 | 1e-5 | 50k | 16 | 64 | 109 GB | 9h |
| 4. LR cooldown | 512 | 1e-6 | 50k | 16 | 64 | 109 GB | 9h |
| Stage | Resolution | LR | Steps | GPUs | Batch/GPU | GPU Memory | Time |
|---|---|---|---|---|---|---|---|
| 1. Train | 512 | 1e-4 | 300k | 32 | 32 | 58 GB | 32h |
| 2. LR cooldown | 512 | 1e-5 | 5k | 32 | 32 | 58 GB | <1h |
@inproceedings{kerssies2026deltatok,
title = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens},
author = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.