Skip to content

Github Actions Test / Benchmark Run #48

@lapp0

Description

@lapp0

In order to verify the correctness of PRs we should have a test / benchmark suite which runs on a variety of hardware / OS configs. GCP seems like the ideal provider here, as they have RTX 6000 Pro's, Windows Server, and we already have systems using GCP.

While we should extend this to a variety of consumer hardware via Vast, this is out of scope for our CI/CD MVP. We should ensure there's coverage for the following machines

  • Windows Server + L4 GPU (G2)
  • Linux + L4 GPU (G2)
  • Linux + RTX PRO 6000 (G4)

Tests / benchmark suite should run when commits are pushed to a "ready for review" PR.

Engine creating parameters should be defined using the signature for WorldEngine.__init__. Here's some good starter WorldEngine Configs:

[
    {"model_uri": "Waypoint-1.5-1B", "quant": null, "model_config_overrides": null},
    {"model_uri": "Waypoint-1.5-1B", "quant": "intw8a8", "model_config_overrides": null},
]

Running

For each (machine, WorldEngine Config) run the following for both main and HEAD of the PR branch

    1. Performance: Run the benchmarks. This should be used to create a table comparing the performance of all machines / configs for main and the PR. examples/benchmark.py can be adapted to a script which calculates LFPS. Should run a 256 frame rollout for now. Note: any failed runs should be marked as such in the benchmark table rather than excluded.
    1. Consistency: Run a forward pass with fully populated KV cache. You can use WorldEngine.get_state(...) and WorldEngine.load_state(...) to create a shared state across all runs. Then calculate the MSE between the latent output of main and this PR. Note: use torch.use_deterministic_algorithms(True) for this step only.

Misc

Heuristic: Prefer fewer changes, fewer added files, fewer lines of code.

Per Mithun: "if possible, we should design it so that it's provider-agnostic and that it's easy to add additional tasks, so that we can onboard vast later and/or add more tests if required (e.g. producing samples that we can look at)"

  • Caveat: if Mithuns suggestion significantly complicates things / increases scope, it can be skipped for now, otherwise it's preferable.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions