Skip to content

Report backend GPUs and bundle GPU benchmarks#509

Merged
i386 merged 13 commits into
mainfrom
codex/skippy-device-gpu-survey
May 12, 2026
Merged

Report backend GPUs and bundle GPU benchmarks#509
i386 merged 13 commits into
mainfrom
codex/skippy-device-gpu-survey

Conversation

@i386
Copy link
Copy Markdown
Collaborator

@i386 i386 commented May 10, 2026

Summary

  • reports GPUs from the active skippy/llama backend device enumeration path instead of platform CLI tools, so CUDA reports CUDA devices, Vulkan reports Vulkan devices, Metal reports Metal devices, and ROCm/HIP stays backend-specific
  • adds the llama.cpp skippy/devices.h ABI plus shared skippy/common.h status/error types, then exposes backend device data through skippy-ffi and skippy-runtime
  • enriches NVIDIA facts through CUDA/NVML SDK libraries instead of nvidia-smi; Linux skippy-enabled survey can still discover NVIDIA GPUs through SDK libraries when skippy reports no GPU devices
  • moves GPU benchmark ownership into a new mesh-llm-gpu-bench crate and compiles benchmark backends into mesh-llm instead of discovering membench-fingerprint* helper executables
  • builds Metal benchmark support into macOS binaries by default, and wires CUDA/ROCm/Intel benchmark backends behind build-flavor features for SDK-backed builds
  • fixes Jetson/Orin pinned GPU startup compatibility by ignoring placeholder PCI IDs like 00000000:00:00.0, resolving UUID aliases, and accepting the single available pinnable GPU for legacy single-GPU pins
  • splits the large hardware implementation into focused modules: hardware/mod.rs, hardware/parsers.rs, hardware/tests.rs, hardware/skippy_devices.rs, and hardware/enrichers.rs

Architecture

  • mesh-llm-system keeps survey, cache, fingerprint, and pinned-GPU policy.
  • mesh-llm-gpu-bench owns native benchmark backend selection and execution.
  • CUDA/HIP/Intel benchmark code is compiled into the Rust crate for matching build flavors; no runtime CLI benchmark fallback remains.
  • The active backend identity is preserved rather than overlaid by Vulkan identities.

Protocol

  • Mesh gossip/protobuf compatibility is unchanged; no wire fields are removed or repurposed.
  • Existing pinned GPU configs continue to work across the UUID/PCI identity transition, including the Jetson single-GPU case.

Testing

  • cargo fmt --all -- --check
  • cargo check -p mesh-llm-gpu-bench
  • cargo check -p mesh-llm-system
  • cargo check -p mesh-llm
  • cargo test -p mesh-llm-system benchmark --lib
  • cargo test -p mesh-llm-system hardware --lib
  • just build-dev
  • local Metal before/after mesh-llm gpus --json: byte-for-byte identical
  • local Metal mesh-llm gpus --json after log suppression: empty stderr and valid GPU JSON
  • local compiled-in Metal benchmark: target/debug/mesh-llm gpus benchmark refreshed one GPU fingerprint without any membench-fingerprint* helper binary present
  • white.local CUDA before/after: hardware fields match; backend label intentionally changes from the old incorrect Vulkan0 overlay to CUDA0
  • white.local CUDA benchmark crate compile: cargo check -p mesh-llm-gpu-bench --features cuda
  • white.local CUDA runtime build: PATH=/usr/lib/llvm-21/bin:$PATH just build-runtime cuda
  • white.local compiled-in CUDA benchmark: target/debug/mesh-llm gpus benchmark refreshed one GPU fingerprint at 908.7 GB/s with no membench-fingerprint* helper binary present
  • white.local CUDA mesh-llm gpus --json: reports backend_device: CUDA0, stable PCI ID, vendor UUID, and no helper binaries in target
  • white.local Vulkan before/after: byte-for-byte identical
  • white.local CPU-linked release build: mesh-llm gpus --json reports NVIDIA GPU via CUDA/NVML SDK discovery and emits empty stderr; scratch/build work used $HOME/tmp/mesh-llm-pr509-review

Notes

  • The new llama.cpp upstream pin is 389ff61d77b5c71cec0cf92fe4e5d01ace80b797.
  • CUDA compiled-in benchmark backend was validated on white.local; HIP and Intel still need runtime validation on machines with those SDK toolchains available.

@i386 i386 force-pushed the codex/skippy-device-gpu-survey branch from 5924b3a to 185f5c2 Compare May 10, 2026 21:07
@i386 i386 changed the title [codex] Report GPUs from skippy backend devices Report GPUs from skippy backend devices May 10, 2026
@i386 i386 marked this pull request as ready for review May 10, 2026 21:09
@i386 i386 requested a review from ndizazzo May 10, 2026 21:09

pub(crate) fn run_gpus(json_output: bool) -> Result<()> {
let mut hw = hardware::survey();
hardware::augment_gpu_facts_with_vulkan_devices(&mut hw.gpus);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndizazzo augmenting gpu facts like this was wrong I think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, seems like an okay idea to me. Why do you figure it was wrong?

Copy link
Copy Markdown
Collaborator

@ndizazzo ndizazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR prevents me from starting mesh-llm on my dual-GPU system with the error:

./target/release/mesh-llm serve
configured gpu_id 'pci:00000000:01:00.0' could not be resolved because this host has no pinnable GPUs; available pinnable GPU IDs: none: startup model 'unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL' failed pinned GPU preflight

Despite the nvidia-smi output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0  On |                  N/A |
|  0%   29C    P8              8W /  500W |    3877MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3080        On  |   00000000:06:00.0 Off |                  N/A |
|  0%   25C    P8             10W /  300W |    8774MiB /  10240MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

but on a DEBUG BUILD it works:

./target/release/mesh-llm gpu
⚠️ No GPUs detected on this node.

./target/debug/mesh-llm gpu
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 41954 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32077 MiB
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 9877 MiB
🖥️ GPU 0
  Name: NVIDIA GeForce RTX 5090
  Stable ID: pci:00000000:01:00.0
  Backend device: CUDA0
  VRAM: 34.2 GB
  Bandwidth: 1661.1 GB/s
  Unified memory: no
  PCI BDF: 00000000:01:00.0
  Vendor UUID: GPU-80ded6bd-1a89-2628-3d94-902187dbab1d

🖥️ GPU 1
  Name: NVIDIA GeForce RTX 3080
  Stable ID: pci:00000000:06:00.0
  Backend device: CUDA1
  VRAM: 10.7 GB
  Bandwidth: 720.2 GB/s
  Unified memory: no
  PCI BDF: 00000000:06:00.0
  Vendor UUID: GPU-6b7fe24c-5f15-4ac5-88d6-c8934135a4ea

debug CPP output still in device detection on Apple:

Release build:

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.024 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0 (Apple M4 Pro)
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 40200.90 MB
🖥️ GPU 0
  Name: Apple M4 Pro
  Stable ID: metal:0
  Backend device: MTL0
  VRAM: 51.5 GB
  Bandwidth: 199.5 GB/s
  Unified memory: yes

debug CPP output still in device detection on Linux:

Release build:

ggml_cuda_init: found 2 CUDA devices (Total VRAM: 41954 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32077 MiB
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 9877 MiB
...

use std::ffi::CStr;

#[derive(Clone, Debug, Default)]
struct NvidiaDeviceInfo {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to obtain the CUDA identifier here? CUDA0, or CUDA1?

@i386 i386 force-pushed the codex/skippy-device-gpu-survey branch from 185f5c2 to cad1bd0 Compare May 10, 2026 22:23
@i386 i386 changed the title Report GPUs from skippy backend devices Report backend GPUs and bundle GPU benchmarks May 10, 2026
@ndizazzo ndizazzo assigned i386 and unassigned ndizazzo May 10, 2026
@ndizazzo ndizazzo self-requested a review May 11, 2026 18:29
@ndizazzo ndizazzo assigned ndizazzo and unassigned i386 May 11, 2026
@ndizazzo ndizazzo force-pushed the codex/skippy-device-gpu-survey branch 2 times, most recently from 630f8e4 to 2255199 Compare May 12, 2026 03:33
@ndizazzo
Copy link
Copy Markdown
Collaborator

Added some additional debugging tools worth keeping to this when I was tracing model loading errors:

./target/debug/mesh-llm serve --listen-all --log-format json --debug | npx node scripts/console-format.js

2026-05-12 03:32:34.767 - INFO: invite token ready for mesh 692f2608fecc1c01775d840bd9b011a4
  ↳ token=eyJpZCI6ImM0NWY2M2VjMmUyZDVlOWQxMWFhODY2ZTE3NmI2ZmYxZGIxOWFhMGM1NDIxN2VhNWNhY2EyYWE2Mjg5NGI5Y2EiLCJhZGRycyI6W3siUmVsYXkiOiJodHRwczovL3VzdzEtMi5yZWxheS5taWNoYWVsbmVhbGUubWVzaC1sbG0uaXJvaC5saW5rLi8ifSx7IklwIjoiMTAuNC4wLjEwOjUyNjQyIn0seyJJcCI6IjEwLjQyLjEuMDo1MjY0MiJ9LHsiSXAiOiIxMC40Mi4xLjE6NTI2NDIifSx7IklwIjoiNjQuMTM3LjE1Ny4xMzk6MCJ9LHsiSXAiOiIxNzIuMTcuMC4xOjUyNjQyIn0seyJJcCI6IjE3Mi4xOS4wLjE6NTI2NDIifV19 | mesh=692f2608fecc1c01775d840bd9b011a4
2026-05-12 03:32:34.767 - INFO: waiting for peers
2026-05-12 03:32:34.767 - INFO: startup plan ready (1 process(es), 2 endpoint(s), 1 model(s))
2026-05-12 03:32:34.767 - INFO: api ready at http://0.0.0.0:9337
2026-05-12 03:32:34.767 - INFO: web console ready at http://0.0.0.0:3131
2026-05-12 03:32:34.981 - INFO: loading model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
  ↳ model=unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
2026-05-12 03:32:43.968 - DEBUG: model load plan: metadata rows=51, tensor rows=851
2026-05-12 03:32:43.968 - DEBUG: metadata 10% (6/51 rows)
2026-05-12 03:32:43.968 - DEBUG: metadata 20% (11/51 rows)
2026-05-12 03:32:43.968 - DEBUG: metadata 30% (16/51 rows)
2026-05-12 03:32:43.968 - DEBUG: metadata 40% (21/51 rows)
2026-05-12 03:32:43.968 - DEBUG: metadata 50% (26/51 rows)
2026-05-12 03:32:43.968 - DEBUG: metadata 60% (31/51 rows)
2026-05-12 03:32:43.969 - DEBUG: metadata 70% (36/51 rows)
2026-05-12 03:32:44.003 - DEBUG: metadata 80% (41/51 rows)
2026-05-12 03:32:44.003 - DEBUG: metadata 90% (46/51 rows)
2026-05-12 03:32:44.003 - DEBUG: metadata 100% (51/51 rows)
2026-05-12 03:32:44.003 - DEBUG: Reading model metadata...
  ↳ architecture=qwen35 | blocks=64 | ctx=262144 | embed=5120 | ffn=17408 | heads=24 | kv_heads=4 | name=Qwen3.6-27B | size=27B | tokenizer=gpt2 | tokenizer_pre=qwen35 | type=model
2026-05-12 03:32:44.003 - DEBUG: tensor types 10% (449/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 20% (449/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 30% (449/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 40% (449/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 50% (449/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 60% (704/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 70% (704/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 80% (704/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 90% (774/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 100% (851/851 types)
2026-05-12 03:32:44.003 - DEBUG: Reading tensor groups...
  ↳ f32=449 | iq4_xs=12 | q4_K=207 | q5_K=70 | q6_K=65 | q8_0=48
2026-05-12 03:32:44.005 - DEBUG: llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 29684 MiB free
2026-05-12 03:32:44.068 - DEBUG: init_tokenizer: initializing tokenizer for type 2
2026-05-12 03:32:44.105 - DEBUG: load: special tokens cache size = 33
2026-05-12 03:32:44.148 - DEBUG: load: token to piece cache size = 1.7581 MB
2026-05-12 03:32:44.148 - DEBUG: load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
2026-05-12 03:32:44.149 - DEBUG: layers 10% (7/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 20% (13/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 30% (20/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 40% (26/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 50% (32/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 60% (39/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 70% (45/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 80% (52/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 90% (58/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 100% (64/64 layers)
2026-05-12 03:32:44.464 - DEBUG: load_tensors:   CPU_Mapped model buffer size =   682.03 MiB
2026-05-12 03:32:45.577 - DEBUG: llama_kv_cache:      CUDA0 KV buffer size =  5984.00 MiB
2026-05-12 03:32:45.581 - DEBUG: kv cache plan: layer rows=16
2026-05-12 03:32:45.581 - DEBUG: llama_kv_cache: attn_rot_k = 1, n_embd_head_k_all = 256
2026-05-12 03:32:45.581 - DEBUG: llama_kv_cache: attn_rot_v = 1, n_embd_head_k_all = 256
2026-05-12 03:32:45.687 - DEBUG: sched_reserve:      CUDA0 compute buffer size =  1192.30 MiB
2026-05-12 03:32:45.687 - DEBUG: sched_reserve:  CUDA_Host compute buffer size =   744.30 MiB
2026-05-12 03:32:45.938 - INFO: loaded model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
  ↳ model=unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
2026-05-12 03:32:45.938 - INFO: mesh-llm runtime ready
  ↳ api=http://0.0.0.0:9337
2026-05-12 03:32:45.938 - INFO: model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL ready on port 43877
  ↳ port=43877 | internal_port=43877 | model=unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
2026-05-12 03:32:45.938 - INFO: Startup-loaded model 'unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL' on :43877
^C2026-05-12 03:32:48.461 - INFO: shutdown requested (SIGINT)
  ↳ signal=SIGINT
2026-05-12 03:32:48.462 - INFO: mesh-llm shutting down
2026-05-12 03:32:48.663 - INFO: unloading model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
  ↳ model=unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
2026-05-12 03:32:48.668 - DEBUG: ~llama_context:      CUDA0 compute buffer size is 1192.3009 MiB, matches expectation of 1192.3009 MiB
2026-05-12 03:32:48.668 - DEBUG: ~llama_context:  CUDA_Host compute buffer size is 744.3047 MiB, matches expectation of 744.3047 MiB
2026-05-12 03:32:48.772 - INFO: Stopped startup model 'unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL' from :43877
2026-05-12 03:32:48.772 - INFO: unloaded model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
  ↳ model=unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL

@ndizazzo
Copy link
Copy Markdown
Collaborator

Merging after #513

@ndizazzo ndizazzo force-pushed the codex/skippy-device-gpu-survey branch from 0b0f98e to 3352e60 Compare May 12, 2026 08:45
@i386 i386 merged commit 57cdac9 into main May 12, 2026
21 checks passed
@i386 i386 deleted the codex/skippy-device-gpu-survey branch May 12, 2026 11:19
@i386
Copy link
Copy Markdown
Collaborator Author

i386 commented May 12, 2026

@ndizazzo thanks for picking this up!!

michaelneale added a commit that referenced this pull request May 13, 2026
* origin/main:
  Trim skippy decode overhead (#537)
  chore(docs): update docs to match planned updates
  Split skippy-server frontend module (#536)
  Add Skippy WAN Docker lab (#528)
  Instrument Skippy binary transport timing (#533)
  Report backend GPUs and bundle GPU benchmarks (#509)
  Fuse warm chat prefix restore with first decode (#527)
  Improve skippy prompt layer-package tokenizer handling (#530)
  Use SQLite for metrics server storage (#529)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants