Report backend GPUs and bundle GPU benchmarks by i386 · Pull Request #509 · Mesh-LLM/mesh-llm

i386 · 2026-05-10T21:02:11Z

Summary

reports GPUs from the active skippy/llama backend device enumeration path instead of platform CLI tools, so CUDA reports CUDA devices, Vulkan reports Vulkan devices, Metal reports Metal devices, and ROCm/HIP stays backend-specific
adds the llama.cpp skippy/devices.h ABI plus shared skippy/common.h status/error types, then exposes backend device data through skippy-ffi and skippy-runtime
enriches NVIDIA facts through CUDA/NVML SDK libraries instead of nvidia-smi; Linux skippy-enabled survey can still discover NVIDIA GPUs through SDK libraries when skippy reports no GPU devices
moves GPU benchmark ownership into a new mesh-llm-gpu-bench crate and compiles benchmark backends into mesh-llm instead of discovering membench-fingerprint* helper executables
builds Metal benchmark support into macOS binaries by default, and wires CUDA/ROCm/Intel benchmark backends behind build-flavor features for SDK-backed builds
fixes Jetson/Orin pinned GPU startup compatibility by ignoring placeholder PCI IDs like 00000000:00:00.0, resolving UUID aliases, and accepting the single available pinnable GPU for legacy single-GPU pins
splits the large hardware implementation into focused modules: hardware/mod.rs, hardware/parsers.rs, hardware/tests.rs, hardware/skippy_devices.rs, and hardware/enrichers.rs

Architecture

mesh-llm-system keeps survey, cache, fingerprint, and pinned-GPU policy.
mesh-llm-gpu-bench owns native benchmark backend selection and execution.
CUDA/HIP/Intel benchmark code is compiled into the Rust crate for matching build flavors; no runtime CLI benchmark fallback remains.
The active backend identity is preserved rather than overlaid by Vulkan identities.

Protocol

Mesh gossip/protobuf compatibility is unchanged; no wire fields are removed or repurposed.
Existing pinned GPU configs continue to work across the UUID/PCI identity transition, including the Jetson single-GPU case.

Testing

cargo fmt --all -- --check
cargo check -p mesh-llm-gpu-bench
cargo check -p mesh-llm-system
cargo check -p mesh-llm
cargo test -p mesh-llm-system benchmark --lib
cargo test -p mesh-llm-system hardware --lib
just build-dev
local Metal before/after mesh-llm gpus --json: byte-for-byte identical
local Metal mesh-llm gpus --json after log suppression: empty stderr and valid GPU JSON
local compiled-in Metal benchmark: target/debug/mesh-llm gpus benchmark refreshed one GPU fingerprint without any membench-fingerprint* helper binary present
white.local CUDA before/after: hardware fields match; backend label intentionally changes from the old incorrect Vulkan0 overlay to CUDA0
white.local CUDA benchmark crate compile: cargo check -p mesh-llm-gpu-bench --features cuda
white.local CUDA runtime build: PATH=/usr/lib/llvm-21/bin:$PATH just build-runtime cuda
white.local compiled-in CUDA benchmark: target/debug/mesh-llm gpus benchmark refreshed one GPU fingerprint at 908.7 GB/s with no membench-fingerprint* helper binary present
white.local CUDA mesh-llm gpus --json: reports backend_device: CUDA0, stable PCI ID, vendor UUID, and no helper binaries in target
white.local Vulkan before/after: byte-for-byte identical
white.local CPU-linked release build: mesh-llm gpus --json reports NVIDIA GPU via CUDA/NVML SDK discovery and emits empty stderr; scratch/build work used $HOME/tmp/mesh-llm-pr509-review

Notes

The new llama.cpp upstream pin is 389ff61d77b5c71cec0cf92fe4e5d01ace80b797.
CUDA compiled-in benchmark backend was validated on white.local; HIP and Intel still need runtime validation on machines with those SDK toolchains available.

i386 · 2026-05-10T21:10:56Z


 pub(crate) fn run_gpus(json_output: bool) -> Result<()> {
    let mut hw = hardware::survey();
-    hardware::augment_gpu_facts_with_vulkan_devices(&mut hw.gpus);


@ndizazzo augmenting gpu facts like this was wrong I think?

Not sure, seems like an okay idea to me. Why do you figure it was wrong?

ndizazzo

This PR prevents me from starting mesh-llm on my dual-GPU system with the error:

./target/release/mesh-llm serve
configured gpu_id 'pci:00000000:01:00.0' could not be resolved because this host has no pinnable GPUs; available pinnable GPU IDs: none: startup model 'unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL' failed pinned GPU preflight

Despite the nvidia-smi output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0  On |                  N/A |
|  0%   29C    P8              8W /  500W |    3877MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3080        On  |   00000000:06:00.0 Off |                  N/A |
|  0%   25C    P8             10W /  300W |    8774MiB /  10240MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

but on a DEBUG BUILD it works:

./target/release/mesh-llm gpu
⚠️ No GPUs detected on this node.

./target/debug/mesh-llm gpu
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 41954 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32077 MiB
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 9877 MiB
🖥️ GPU 0
  Name: NVIDIA GeForce RTX 5090
  Stable ID: pci:00000000:01:00.0
  Backend device: CUDA0
  VRAM: 34.2 GB
  Bandwidth: 1661.1 GB/s
  Unified memory: no
  PCI BDF: 00000000:01:00.0
  Vendor UUID: GPU-80ded6bd-1a89-2628-3d94-902187dbab1d

🖥️ GPU 1
  Name: NVIDIA GeForce RTX 3080
  Stable ID: pci:00000000:06:00.0
  Backend device: CUDA1
  VRAM: 10.7 GB
  Bandwidth: 720.2 GB/s
  Unified memory: no
  PCI BDF: 00000000:06:00.0
  Vendor UUID: GPU-6b7fe24c-5f15-4ac5-88d6-c8934135a4ea

debug CPP output still in device detection on Apple:

Release build:

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.024 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0 (Apple M4 Pro)
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 40200.90 MB
🖥️ GPU 0
  Name: Apple M4 Pro
  Stable ID: metal:0
  Backend device: MTL0
  VRAM: 51.5 GB
  Bandwidth: 199.5 GB/s
  Unified memory: yes

debug CPP output still in device detection on Linux:

Release build:

ggml_cuda_init: found 2 CUDA devices (Total VRAM: 41954 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32077 MiB
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 9877 MiB
...

ndizazzo · 2026-05-10T21:12:47Z

+    use std::ffi::CStr;
+
+    #[derive(Clone, Debug, Default)]
+    struct NvidiaDeviceInfo {


Possible to obtain the CUDA identifier here? CUDA0, or CUDA1?

ndizazzo · 2026-05-12T03:34:44Z

Added some additional debugging tools worth keeping to this when I was tracing model loading errors:

./target/debug/mesh-llm serve --listen-all --log-format json --debug | npx node scripts/console-format.js

2026-05-12 03:32:34.767 - INFO: invite token ready for mesh 692f2608fecc1c01775d840bd9b011a4
  ↳ token=eyJpZCI6ImM0NWY2M2VjMmUyZDVlOWQxMWFhODY2ZTE3NmI2ZmYxZGIxOWFhMGM1NDIxN2VhNWNhY2EyYWE2Mjg5NGI5Y2EiLCJhZGRycyI6W3siUmVsYXkiOiJodHRwczovL3VzdzEtMi5yZWxheS5taWNoYWVsbmVhbGUubWVzaC1sbG0uaXJvaC5saW5rLi8ifSx7IklwIjoiMTAuNC4wLjEwOjUyNjQyIn0seyJJcCI6IjEwLjQyLjEuMDo1MjY0MiJ9LHsiSXAiOiIxMC40Mi4xLjE6NTI2NDIifSx7IklwIjoiNjQuMTM3LjE1Ny4xMzk6MCJ9LHsiSXAiOiIxNzIuMTcuMC4xOjUyNjQyIn0seyJJcCI6IjE3Mi4xOS4wLjE6NTI2NDIifV19 | mesh=692f2608fecc1c01775d840bd9b011a4
2026-05-12 03:32:34.767 - INFO: waiting for peers
2026-05-12 03:32:34.767 - INFO: startup plan ready (1 process(es), 2 endpoint(s), 1 model(s))
2026-05-12 03:32:34.767 - INFO: api ready at http://0.0.0.0:9337
2026-05-12 03:32:34.767 - INFO: web console ready at http://0.0.0.0:3131
2026-05-12 03:32:34.981 - INFO: loading model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
  ↳ model=unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
2026-05-12 03:32:43.968 - DEBUG: model load plan: metadata rows=51, tensor rows=851
2026-05-12 03:32:43.968 - DEBUG: metadata 10% (6/51 rows)
2026-05-12 03:32:43.968 - DEBUG: metadata 20% (11/51 rows)
2026-05-12 03:32:43.968 - DEBUG: metadata 30% (16/51 rows)
2026-05-12 03:32:43.968 - DEBUG: metadata 40% (21/51 rows)
2026-05-12 03:32:43.968 - DEBUG: metadata 50% (26/51 rows)
2026-05-12 03:32:43.968 - DEBUG: metadata 60% (31/51 rows)
2026-05-12 03:32:43.969 - DEBUG: metadata 70% (36/51 rows)
2026-05-12 03:32:44.003 - DEBUG: metadata 80% (41/51 rows)
2026-05-12 03:32:44.003 - DEBUG: metadata 90% (46/51 rows)
2026-05-12 03:32:44.003 - DEBUG: metadata 100% (51/51 rows)
2026-05-12 03:32:44.003 - DEBUG: Reading model metadata...
  ↳ architecture=qwen35 | blocks=64 | ctx=262144 | embed=5120 | ffn=17408 | heads=24 | kv_heads=4 | name=Qwen3.6-27B | size=27B | tokenizer=gpt2 | tokenizer_pre=qwen35 | type=model
2026-05-12 03:32:44.003 - DEBUG: tensor types 10% (449/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 20% (449/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 30% (449/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 40% (449/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 50% (449/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 60% (704/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 70% (704/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 80% (704/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 90% (774/851 types)
2026-05-12 03:32:44.003 - DEBUG: tensor types 100% (851/851 types)
2026-05-12 03:32:44.003 - DEBUG: Reading tensor groups...
  ↳ f32=449 | iq4_xs=12 | q4_K=207 | q5_K=70 | q6_K=65 | q8_0=48
2026-05-12 03:32:44.005 - DEBUG: llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 29684 MiB free
2026-05-12 03:32:44.068 - DEBUG: init_tokenizer: initializing tokenizer for type 2
2026-05-12 03:32:44.105 - DEBUG: load: special tokens cache size = 33
2026-05-12 03:32:44.148 - DEBUG: load: token to piece cache size = 1.7581 MB
2026-05-12 03:32:44.148 - DEBUG: load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
2026-05-12 03:32:44.149 - DEBUG: layers 10% (7/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 20% (13/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 30% (20/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 40% (26/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 50% (32/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 60% (39/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 70% (45/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 80% (52/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 90% (58/64 layers)
2026-05-12 03:32:44.149 - DEBUG: layers 100% (64/64 layers)
2026-05-12 03:32:44.464 - DEBUG: load_tensors:   CPU_Mapped model buffer size =   682.03 MiB
2026-05-12 03:32:45.577 - DEBUG: llama_kv_cache:      CUDA0 KV buffer size =  5984.00 MiB
2026-05-12 03:32:45.581 - DEBUG: kv cache plan: layer rows=16
2026-05-12 03:32:45.581 - DEBUG: llama_kv_cache: attn_rot_k = 1, n_embd_head_k_all = 256
2026-05-12 03:32:45.581 - DEBUG: llama_kv_cache: attn_rot_v = 1, n_embd_head_k_all = 256
2026-05-12 03:32:45.687 - DEBUG: sched_reserve:      CUDA0 compute buffer size =  1192.30 MiB
2026-05-12 03:32:45.687 - DEBUG: sched_reserve:  CUDA_Host compute buffer size =   744.30 MiB
2026-05-12 03:32:45.938 - INFO: loaded model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
  ↳ model=unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
2026-05-12 03:32:45.938 - INFO: mesh-llm runtime ready
  ↳ api=http://0.0.0.0:9337
2026-05-12 03:32:45.938 - INFO: model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL ready on port 43877
  ↳ port=43877 | internal_port=43877 | model=unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
2026-05-12 03:32:45.938 - INFO: Startup-loaded model 'unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL' on :43877
^C2026-05-12 03:32:48.461 - INFO: shutdown requested (SIGINT)
  ↳ signal=SIGINT
2026-05-12 03:32:48.462 - INFO: mesh-llm shutting down
2026-05-12 03:32:48.663 - INFO: unloading model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
  ↳ model=unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
2026-05-12 03:32:48.668 - DEBUG: ~llama_context:      CUDA0 compute buffer size is 1192.3009 MiB, matches expectation of 1192.3009 MiB
2026-05-12 03:32:48.668 - DEBUG: ~llama_context:  CUDA_Host compute buffer size is 744.3047 MiB, matches expectation of 744.3047 MiB
2026-05-12 03:32:48.772 - INFO: Stopped startup model 'unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL' from :43877
2026-05-12 03:32:48.772 - INFO: unloaded model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL
  ↳ model=unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL

ndizazzo · 2026-05-12T06:51:38Z

Merging after #513

i386 · 2026-05-12T11:19:54Z

@ndizazzo thanks for picking this up!!

* origin/main: Trim skippy decode overhead (#537) chore(docs): update docs to match planned updates Split skippy-server frontend module (#536) Add Skippy WAN Docker lab (#528) Instrument Skippy binary transport timing (#533) Report backend GPUs and bundle GPU benchmarks (#509) Fuse warm chat prefix restore with first decode (#527) Improve skippy prompt layer-package tokenizer handling (#530) Use SQLite for metrics server storage (#529)

i386 force-pushed the codex/skippy-device-gpu-survey branch from 5924b3a to 185f5c2 Compare May 10, 2026 21:07

i386 changed the title ~~[codex] Report GPUs from skippy backend devices~~ Report GPUs from skippy backend devices May 10, 2026

i386 marked this pull request as ready for review May 10, 2026 21:09

i386 requested a review from ndizazzo May 10, 2026 21:09

i386 assigned ndizazzo May 10, 2026

i386 commented May 10, 2026

View reviewed changes

ndizazzo requested changes May 10, 2026

View reviewed changes

i386 force-pushed the codex/skippy-device-gpu-survey branch from 185f5c2 to cad1bd0 Compare May 10, 2026 22:23

i386 changed the title ~~Report GPUs from skippy backend devices~~ Report backend GPUs and bundle GPU benchmarks May 10, 2026

ndizazzo assigned i386 and unassigned ndizazzo May 10, 2026

ndizazzo added the Do not merge label May 11, 2026

ndizazzo self-requested a review May 11, 2026 18:29

ndizazzo assigned ndizazzo and unassigned i386 May 11, 2026

ndizazzo force-pushed the codex/skippy-device-gpu-survey branch 2 times, most recently from 630f8e4 to 2255199 Compare May 12, 2026 03:33

ndizazzo approved these changes May 12, 2026

View reviewed changes

ndizazzo removed the Do not merge label May 12, 2026

i386 and others added 9 commits May 12, 2026 04:41

Expose skippy backend device survey

7e0338c

Bundle GPU benchmark backends

cb52664

Fix CI benchmark wiring

7457cb5

Allow manual Docker CI runs

31e6866

Restore CPU-only runtime capacity

8110fdb

Fix CUDA loading issue on Linux

73ae627

add debug information to help trace issues on CUDA devices with two GPUs

a30ab6a

fix linters / tests

f7afa05

fixups for remaining tests via mutex

e7030f8

ndizazzo added 3 commits May 12, 2026 04:41

fix ROCm build

c65da16

coupel more fixes for windows

ba05974

debugging on ci is cool

3352e60

ndizazzo force-pushed the codex/skippy-device-gpu-survey branch from 0b0f98e to 3352e60 Compare May 12, 2026 08:45

fix remaining red tests after rebase

8063176

i386 merged commit 57cdac9 into main May 12, 2026
21 checks passed

i386 deleted the codex/skippy-device-gpu-survey branch May 12, 2026 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report backend GPUs and bundle GPU benchmarks#509

Report backend GPUs and bundle GPU benchmarks#509
i386 merged 13 commits into
mainfrom
codex/skippy-device-gpu-survey

i386 commented May 10, 2026 •

edited

Loading

Uh oh!

i386 May 10, 2026

Uh oh!

ndizazzo May 10, 2026

Uh oh!

ndizazzo left a comment •

edited

Loading

Uh oh!

ndizazzo May 10, 2026

Uh oh!

ndizazzo commented May 12, 2026

Uh oh!

ndizazzo commented May 12, 2026

Uh oh!

Uh oh!

i386 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

i386 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Protocol

Testing

Notes

Uh oh!

i386 May 10, 2026

Choose a reason for hiding this comment

Uh oh!

ndizazzo May 10, 2026

Choose a reason for hiding this comment

Uh oh!

ndizazzo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ndizazzo May 10, 2026

Choose a reason for hiding this comment

Uh oh!

ndizazzo commented May 12, 2026

Uh oh!

ndizazzo commented May 12, 2026

Uh oh!

Uh oh!

i386 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

i386 commented May 10, 2026 •

edited

Loading

ndizazzo left a comment •

edited

Loading