Skip to content

feat(api): add make run-gpu for native CUDA serving#52

Merged
Chouffe merged 3 commits into
mainfrom
arthur/api-gpu-run
Jun 12, 2026
Merged

feat(api): add make run-gpu for native CUDA serving#52
Chouffe merged 3 commits into
mainfrom
arthur/api-gpu-run

Conversation

@Chouffe

@Chouffe Chouffe commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds two targets to api/Makefile for serving the API on a CUDA GPU natively (the published Docker image intentionally stays CPU-only — no second image variant):
    • gpu-setup (one-time): re-syncs the venv, then swaps torch/torchvision for cu130 wheels pinned to the uv.lock releases (2.12.0/0.27.0) so the GPU venv can't silently drift from the tested versions.
    • run-gpu: serves with uv run --no-sync, after asserting CUDA is actually available — it refuses to start (instead of silently serving on CPU) when a plain uv run/uv sync has reverted the venv to the locked CPU wheels. Restarts do no reinstall work and need no network.
  • Documents the flow in api/README.md: native runs must set TEMPORAL_API_MODEL_PATH (the /models/model.zip default is container-only), make test/lint/format revert the CUDA wheels, and run-gpu binds 0.0.0.0:8000 so set TEMPORAL_API_TOKEN on shared networks.

Test Plan

  • make gpu-setup + make run-gpu end-to-end on an RTX 4070 Ti SUPER (driver 580.159.03): server holds GPU memory (1172 MiB), POST /predict on a real 7-frame MinIO-backed sequence returns is_smoke=true, probability=0.87009 — identical to CPU serving.
  • Pin verified: gpu-setup installs exactly torch==2.12.0+cu130 / torchvision==0.27.0+cu130, with numpy staying at the locked 1.26.4 (the unpinned form had bumped it to 2.4.4).
  • Guard verified: make run-gpu on a CPU venv exits 2 with CUDA torch not available - run: make gpu-setup.
  • Restart path verified: run-gpu goes straight to uvicorn — no sync/reinstall churn.
  • pytest (122 passed, 1 skipped) and ruff check clean.

The released Docker image stays CPU-only on purpose (torch pinned to the
pytorch-cpu wheel index keeps it small). For GPU benchmarking/dev, run-gpu
swaps the venv's torch/torchvision for cu130 wheels and serves uvicorn with
--no-sync so the CUDA wheels survive startup.

Verified end-to-end on an RTX 4070 Ti SUPER: /predict on a real 7-frame
sequence returns the same prediction as CPU serving, with the server
process on the GPU.
@Chouffe Chouffe requested a review from MateoLostanlen June 11, 2026 16:15
Chouffe added 2 commits June 11, 2026 18:24
…llback

Review follow-ups on the GPU recipe:

- Pin the cu130 install to the uv.lock releases (torch==2.12.0,
  torchvision==0.27.0) so the GPU venv cannot silently drift to a newer
  torch than the one CI tests; the pin also keeps shared deps (numpy)
  at their locked versions.
- Split the one-time wheel swap (gpu-setup) from serving (run-gpu) so
  restarts skip the multi-GB venv churn and need no network.
- run-gpu now asserts CUDA is available before serving instead of
  silently falling back to CPU after a plain uv run/uv sync reverted
  the wheels.
- README: document that MODEL_PATH must be set natively (the
  /models/model.zip default is container-only), that make
  test/lint/format revert the wheels, and to set TEMPORAL_API_TOKEN
  on shared networks since run-gpu binds 0.0.0.0.
Remaining review follow-ups:

- gpu-setup's help text now points at the README instead of repeating
  the driver floor, so the requirement lives in one place.
- README: note that the compose api service publishes the same port
  8000 as run-gpu; start only minio/createbuckets when compose is
  just providing S3.

Kept --reinstall-package (not the shorter --reinstall): tested the
swap and --reinstall force-reinstalls the whole resolution, bumping
numpy past its locked 1.26.4 — the same drift the version pins
exist to prevent.

@MateoLostanlen MateoLostanlen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Chouffe Chouffe merged commit 327a951 into main Jun 12, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants