In order to verify the correctness of PRs we should have a test / benchmark suite which runs on a variety of hardware / OS configs. GCP seems like the ideal provider here, as they have RTX 6000 Pro's, Windows Server, and we already have systems using GCP.
While we should extend this to a variety of consumer hardware via Vast, this is out of scope for our CI/CD MVP. We should ensure there's coverage for the following machines
- Windows Server + L4 GPU (G2)
- Linux + L4 GPU (G2)
- Linux + RTX PRO 6000 (G4)
Tests / benchmark suite should run when commits are pushed to a "ready for review" PR.
Engine creating parameters should be defined using the signature for WorldEngine.__init__. Here's some good starter WorldEngine Configs:
[
{"model_uri": "Waypoint-1.5-1B", "quant": null, "model_config_overrides": null},
{"model_uri": "Waypoint-1.5-1B", "quant": "intw8a8", "model_config_overrides": null},
]
Running
For each (machine, WorldEngine Config) run the following for both main and HEAD of the PR branch
-
- Performance: Run the benchmarks. This should be used to create a table comparing the performance of all machines / configs for
main and the PR. examples/benchmark.py can be adapted to a script which calculates LFPS. Should run a 256 frame rollout for now. Note: any failed runs should be marked as such in the benchmark table rather than excluded.
-
- Consistency: Run a forward pass with fully populated KV cache. You can use
WorldEngine.get_state(...) and WorldEngine.load_state(...) to create a shared state across all runs. Then calculate the MSE between the latent output of main and this PR. Note: use torch.use_deterministic_algorithms(True) for this step only.
Misc
Heuristic: Prefer fewer changes, fewer added files, fewer lines of code.
Per Mithun: "if possible, we should design it so that it's provider-agnostic and that it's easy to add additional tasks, so that we can onboard vast later and/or add more tests if required (e.g. producing samples that we can look at)"
- Caveat: if Mithuns suggestion significantly complicates things / increases scope, it can be skipped for now, otherwise it's preferable.
In order to verify the correctness of PRs we should have a test / benchmark suite which runs on a variety of hardware / OS configs. GCP seems like the ideal provider here, as they have RTX 6000 Pro's, Windows Server, and we already have systems using GCP.
While we should extend this to a variety of consumer hardware via Vast, this is out of scope for our CI/CD MVP. We should ensure there's coverage for the following machines
Tests / benchmark suite should run when commits are pushed to a "ready for review" PR.
Engine creating parameters should be defined using the signature for
WorldEngine.__init__. Here's some good starter WorldEngine Configs:Running
For each (machine, WorldEngine Config) run the following for both main and HEAD of the PR branch
mainand the PR.examples/benchmark.pycan be adapted to a script which calculates LFPS. Should run a 256 frame rollout for now. Note: any failed runs should be marked as such in the benchmark table rather than excluded.WorldEngine.get_state(...)andWorldEngine.load_state(...)to create a shared state across all runs. Then calculate the MSE between the latent output ofmainand this PR. Note: usetorch.use_deterministic_algorithms(True)for this step only.Misc
Heuristic: Prefer fewer changes, fewer added files, fewer lines of code.
Per Mithun: "if possible, we should design it so that it's provider-agnostic and that it's easy to add additional tasks, so that we can onboard vast later and/or add more tests if required (e.g. producing samples that we can look at)"