Skip to content

amazon-far/deltatok

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

CVPR 2026 Highlight  Paper  Models

DeltaTok compresses the frame-to-frame change in vision foundation model features into a single delta token, enabling DeltaWorld to efficiently generate diverse plausible futures.

Model Zoo

All models operate at 512x512 resolution with a frozen DINOv3 ViT-B backbone. The released DeltaTok and DeltaWorld are trained on Kinetics-700, while the paper uses a larger dataset. See Training & Evaluation and Example Training Resources for reproduction.

Task Heads

Evaluation heads for downstream tasks:

Task Dataset Metric Download
Segmentation VSPW mIoU: 58.4 Download
Segmentation Cityscapes mIoU: 70.5 Download
Depth KITTI RMSE: 2.79 Download
RGB ImageNet visualization only Download

DeltaTok (Tokenizer) Download

ViT-B encoder and decoder trained on Kinetics-700. Reconstruction quality is measured by applying downstream task heads to the reconstructed features.

Horizon VSPW mIoU (↑) Cityscapes mIoU (↑) KITTI RMSE (↓)
Short (1 frame) 58.6 69.6 2.78
Mid (3 frames)* 58.5 67.9 2.86

*Parallel encoding from ground-truth frames with autoregressive decoding from previous reconstructions.

DeltaWorld (Predictor) Download

ViT-B predictor trained on Kinetics-700. Prediction quality is measured by applying downstream task heads to the predicted features. Cells report best-of-20 with mean in parentheses. best selects the sample with lowest DINOv3-feature loss to ground truth; mean averages DINOv3 features across all samples before evaluation.

Method Horizon VSPW mIoU (↑) Cityscapes mIoU (↑) KITTI RMSE (↓)
Copy last (lower bound) Short (1 frame) 51.2 53.5 3.76
DeltaWorld Short (1 frame) 56.3 (54.2) 66.2 (64.2) 2.95 (3.32)
Copy last (lower bound) Mid (3 frames) 44.3 39.6 4.86
DeltaWorld Mid (3 frames) 51.5 (46.6) 55.3 (49.5) 3.71 (4.74)

Setup

Requires Miniconda (or Anaconda), a Weights & Biases account for logging, and a Hugging Face account. Accept the license at facebook/dinov3-vitb16-pretrain-lvd1689m so the gated DINOv3 ViT-B backbone downloads automatically on first run.

conda create -n deltatok python=3.14.2
conda activate deltatok
pip install -r requirements.txt
wandb login
hf auth login
cp .env.example .env

Data Preparation

Prepare Kinetics-700 to train from scratch, and any of VSPW, Cityscapes, or KITTI for evaluation metrics and visualizations on that dataset. For each dataset you prepare, set the corresponding *_ROOT path in .env to the absolute path of the downloaded dataset directory.

Kinetics-700 (training, ~1.2 TB)

mkdir -p kinetics/train
wget -i https://s3.amazonaws.com/kinetics/700_2020/train/k700_2020_train_path.txt -P k700_tars/
for f in k700_tars/*.tar.gz; do tar -xzf "$f" -C kinetics/train; done

Pre-extracted frames (as a directory of frame folders or zip archives) are also supported for faster data loading. See datasets/kinetics.py for details.

VSPW (evaluation, ~43 GB)

pip install gdown
gdown "https://drive.google.com/file/d/14yHWsGneoa1pVdULFk7cah3t-THl7yEz/view?usp=sharing" --fuzzy
tar -xf VSPW_dataset.tar  # extracts to VSPW/

If gdown fails due to rate limiting, download VSPW_dataset.tar manually from the Google Drive link.

Cityscapes (evaluation, ~325 GB)

Requires registration at the Cityscapes website. Set CITYSCAPES_USERNAME and CITYSCAPES_PASSWORD environment variables for headless servers, or csDownload will prompt interactively.

pip install cityscapesscripts
mkdir -p cityscapes
csDownload -d cityscapes gtFine_trainvaltest.zip leftImg8bit_sequence_trainvaltest.zip
cd cityscapes && unzip -q gtFine_trainvaltest.zip && unzip -q leftImg8bit_sequence_trainvaltest.zip && cd ..

KITTI (evaluation, ~44 GB)

wget https://s3.eu-central-1.amazonaws.com/avg-kitti/data_depth_annotated.zip
unzip data_depth_annotated.zip -d kitti && rm data_depth_annotated.zip
for drive in 2011_09_26_drive_{0002,0009,0013,0020,0023,0027,0029,0036,0046,0048,0052,0056,0059,0064,0084,0086,0093,0096,0101,0106,0117} 2011_09_28_drive_0002 2011_09_29_drive_0071 2011_09_30_drive_{0016,0018,0027} 2011_10_03_drive_{0027,0047}; do
  wget -P kitti "https://s3.eu-central-1.amazonaws.com/avg-kitti/raw_data/${drive}/${drive}_sync.zip"
  unzip -o -d kitti "kitti/${drive}_sync.zip" && rm "kitti/${drive}_sync.zip"
done

Training & Evaluation

Training and evaluation use Lightning CLI. To get evaluation metrics and visualizations on a dataset, download the pre-trained task head for that dataset and set the corresponding *_HEAD_PATH in .env to the absolute path of the downloaded file.

The effective batch size should be 1024 for both DeltaTok and DeltaWorld. It's the product of four parameters:

--data.batch_size × --trainer.devices × --trainer.num_nodes × --trainer.accumulate_grad_batches

The default config reaches this on a single node with 8 GPUs at per-GPU batch size 128 and no gradient accumulation; adjust any of the four parameters to fit your hardware. See Example Training Resources for the configurations we used for each stage.

Training DeltaTok (Tokenizer)

Stage 1: Pre-train at 256px

python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
  --data.frame_size=256 \
  --trainer.max_steps=1000000

Stage 2: High-resolution fine-tune at 512px

--model.ckpt_path loads model weights only; optimizer state and step counter reset.

python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
  --model.lr=1e-4 \
  --trainer.max_steps=500000 \
  --model.ckpt_path=path/to/stage1/last.ckpt

Stage 3-4: LR cooldowns

--ckpt_path resumes full training state (model weights, optimizer state, step counter).

# Stage 3
python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
  --model.lr=1e-5 \
  --trainer.max_steps=550000 \
  --ckpt_path=path/to/stage2/last.ckpt

# Stage 4
python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
  --model.lr=1e-6 \
  --trainer.max_steps=600000 \
  --ckpt_path=path/to/stage3/last.ckpt

Training DeltaWorld (Predictor)

Requires a DeltaTok checkpoint: either the released one (pytorch_model.bin) or one from your own training (last.ckpt).

python main.py fit -c configs/deltaworld_vitb_dinov3_vitb_kinetics.yaml \
  --model.network.tokenizer.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin \
  --trainer.max_steps=300000

LR cooldown

python main.py fit -c configs/deltaworld_vitb_dinov3_vitb_kinetics.yaml \
  --model.lr=1e-5 \
  --trainer.max_steps=305000 \
  --ckpt_path=path/to/deltaworld/last.ckpt

Evaluation

DeltaTok

python main.py validate -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
  --model.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin

DeltaWorld

Requires both DeltaTok and DeltaWorld checkpoints.

python main.py validate -c configs/deltaworld_vitb_dinov3_vitb_kinetics.yaml \
  --model.ckpt_path=path/to/deltaworld-kinetics/pytorch_model.bin \
  --model.network.tokenizer.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin

Example Training Resources

Training times and memory are measured on NVIDIA H200 GPUs. The configurations below are examples; any setup that reaches the target effective batch size works.

DeltaTok

Stage Resolution LR Steps GPUs Batch/GPU GPU Memory Time
1. Pre-train 256 1e-3 1M 8 128 65 GB 82h
2. Hi-res fine-tune 512 1e-4 500k 16 64 109 GB 89h
3. LR cooldown 512 1e-5 50k 16 64 109 GB 9h
4. LR cooldown 512 1e-6 50k 16 64 109 GB 9h

DeltaWorld

Stage Resolution LR Steps GPUs Batch/GPU GPU Memory Time
1. Train 512 1e-4 300k 32 32 58 GB 32h
2. LR cooldown 512 1e-5 5k 32 32 58 GB <1h

Citation

@inproceedings{kerssies2026deltatok,
  title     = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens},
  author    = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgements

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.