Conversation
Adds a fourth "local" provider alongside llama.cpp / Ollama / LM Studio.
Unlike those, this one needs nothing installed — model weights download
from HuggingFace on first use (~500MB for q4 Qwen 3 0.6B), cached in
IndexedDB by @huggingface/transformers, inference runs on the user's
GPU in the extension's offscreen document.
Architecture:
service worker
└─ WebGPUProvider.chat() ──┐
▼ chrome.runtime.sendMessage
offscreen document
└─ @huggingface/transformers
pipeline('text-generation', 'onnx-community/Qwen3-0.6B-ONNX',
{ device: 'webgpu', dtype: 'q4' })
Service workers have no WebGPU; the offscreen document does. We reuse
the existing offscreen doc (already hosting the local-network fetch
proxy) and add new message handlers `webgpu-chat` and `webgpu-probe`.
Tool use: enabled. Qwen 3's chat template renders `tools=[...]` into
the system prompt and the model emits `<tool_call>{...}</tool_call>`
blocks; offscreen.js parses them into OpenAI-format tool_calls so the
agent's loop detector / dispatch see WebGPU exactly like any other
provider. Reliability at 0.6B is mixed — the settings card will nudge
users toward Ask mode in a follow-up.
Streaming: v1 returns the full response (no per-token streaming yet).
The 0.6B model finishes a normal turn in seconds; the round-trip-and-
yield simplification let us ship the provider without first solving
the background↔offscreen chunked-message router. Comment in
webgpu.js's chatStream() flags the upgrade target.
Default-disabled (`enabled:false`) because the first-run download is
substantial and the library has to be vendored locally — see
src/chrome/vendor/transformers/README.md for the one-command vendoring
flow. The provider returns a clear "library not vendored" error when
the file is missing, so the failure mode is obvious.
Firefox: stub that fails fast with "not yet supported on Firefox".
Firefox doesn't have browser.offscreen and its extension-context WebGPU
exposure is its own can of worms — wiring those is its own future PR.
Stub stays so the categorization parity test stays green.
Tests: 4 new (130 total, all passing). webgpu provider present + local
+ disabled by default; no network fields (truly in-browser); _create-
Provider wires the right class; chrome/firefox provider sets stay in
sync. The actual chat() path can't be exercised in Node — no chrome.
offscreen, no WebGPU — but the wiring + classification do.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c7ed0192e5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| type: result.toolCalls ? 'tool_call' : 'text', | ||
| content: result.content, | ||
| toolCalls: result.toolCalls, |
There was a problem hiding this comment.
Emit tool calls in streaming chunk content
When chatStream() returns a tool call, this chunk sets content to result.content (text) instead of the tool-call array. The streaming agent path (processMessageStream) iterates chunk.content as tool-call deltas, so WebGPU tool calls are dropped/misparsed and Act-mode tool execution fails whenever the chat_stream workflow is used.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| async function getPipeline(modelId, dtype, device) { | ||
| if (_activePipeline && _activeModelId === modelId) return _activePipeline; |
There was a problem hiding this comment.
Rebuild cached pipeline when dtype or device changes
The pipeline cache key only checks modelId, but users can edit dtype and device in provider settings. After one successful load, changing quantization/backend (for example q4→q8 or webgpu→wasm) will silently keep using the old pipeline, so configuration changes do not take effect until the offscreen document is recreated.
Useful? React with 👍 / 👎.
The first time a user picks the WebGPU provider, ~500MB of Qwen 3
weights pull from HF Hub — a ~30-60s wait the existing UI doesn't
hint at at all. Renders as a frozen "thinking…" spinner, which is
indistinguishable from a hang.
Add a progress card at the top of the messages container:
┌──────────────────────────────────────────────────┐
│ Downloading onnx-community/Qwen3-0.6B-ONNX — │
│ 142 / 487 MB │
│ █████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
│ onnx/model_q4.onnx │
└──────────────────────────────────────────────────┘
- Aggregates loaded/total across files (model has ~8 parallel
downloads — weights, tokenizer, config, etc.).
- Bar fills to 100% + flips green on the 'ready' event, then
auto-dismisses ~1.8s later so the user sees confirmation.
- Throttled to one progress update per file per 200ms so the
message channel doesn't drown in callbacks.
- Fire-and-forget broadcast from offscreen → sidepanel (.catch
swallows "no listener" errors when no panel is open).
Firefox side has the same listener + renderer for parity, even
though the Firefox WebGPU provider itself is still stubbed — once
the Firefox path is wired up, the progress UI is ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commits the three files the WebGPU provider needs:
src/chrome/vendor/transformers/
├─ transformers.web.min.js (422 KB — browser ESM bundle)
├─ ort-wasm-simd-threaded.jsep.mjs ( 46 KB — WASM loader shim)
└─ ort-wasm-simd-threaded.jsep.wasm ( 25 MB — WebGPU ONNX runtime)
Yes, the .wasm is 25MB. It's the cost of shipping a real local LLM
runtime — there's no smaller variant that does WebGPU. The trade-off:
the extension grows from ~2MB to ~28MB on disk, in exchange for a
provider that works straight from `git clone` with zero per-developer
setup. Previous behaviour was "vendor the library yourself per the
README" which is realistic for a 1-person team and friction for any
larger group.
Implementation details:
- We vendor the .web.min.js variant (not the dual ESM/CJS
transformers.min.js or the Node builds). Smaller, browser-only,
matches our actual import path.
- env.backends.onnx.wasm.wasmPaths is pinned to the vendor dir's
chrome-extension:// URL. Without this the loader resolves the
WASM path relative to transformers.web.min.js's URL — which
happens to work today because they're siblings, but only by
accident. Setting it explicitly makes the wiring obvious and
survives future re-vendoring at different paths. Wrapped in
try/catch so library shape changes between versions fall back
to default resolution.
- The CPU-fallback WASM variants (.wasm / .asyncify.wasm /
.jspi.wasm) are intentionally NOT vendored — system without
WebGPU gets a clear "WebGPU not available" error instead. Saves
~40MB of WASM we don't use. Add them later if CPU fallback
becomes a real ask.
- Firefox vendor dir stays empty (gitignored) — the Firefox WebGPU
provider is still a stub; no point shipping 25MB of WASM it
doesn't reach. Comment in .gitignore flags this for whoever
wires the Firefox path next.
- package.json now lists @huggingface/transformers as a regular
dep (not devDep) — semantically wrong for an ESM file we commit,
but useful: `npm install` keeps the version pinned for whoever
needs to update the vendored files later. The README documents
the update flow.
The README in the vendor dir reflects the new "it's checked in"
reality and explains the update procedure for next time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
specifier)
Real bug shipped with the previous "vendor the library" commit:
transformers.web.min.js contains a dynamic
`import("onnxruntime-web/webgpu")` — a bare module specifier. The
browser can't resolve bare specifiers without a build step or an
import map, and the WebGPU provider failed on first chat with:
Failed to resolve module specifier "onnxruntime-web/webgpu".
Relative references must start with either "/", "./", or "../".
Two-line fix:
1. Vendor onnxruntime-web/dist/ort.webgpu.bundle.min.mjs (111KB,
fully self-bundled — no further bare imports inside it).
2. Rewrite the bare specifier in our vendored transformers.web.min.js
to "./ort.webgpu.bundle.min.mjs" so it resolves as a relative
URL against the patched file's own location. One sed replace,
verified the count goes 1→0.
Why not an import map: MV3's CSP `script-src 'self'` can block inline
`<script type="importmap">` on some Chrome versions. Patching the
specifier sidesteps the CSP question entirely.
The webgpu bundle is self-contained (the bundled variant inlines all
ONNX-runtime dependencies it needs at WebGPU-init time), so no
external WASM fetch happens during normal WebGPU inference. The
existing jsep.wasm + jsep.mjs files stay vendored as a defensive
fallback path in case env.backends.onnx.wasm.wasmPaths ever gets
hit at runtime — they're never loaded for WebGPU, but cost nothing
since they're already there.
Vendor README updated with the sed step + verification command so
re-vendoring a future library version doesn't reintroduce the bug.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes folded into one:
1. CHROME WEB STORE / AMO compatibility. The previous commit vendored
.min.js builds; both stores want readable source for review and can
reject or stall reviews of minified blobs. Switch to the unminified
variants:
transformers.web.min.js → transformers.web.js (~422K → 1.1M)
ort.webgpu.bundle.min.mjs → ort.webgpu.mjs (~111K → 662K)
Total vendor dir grows from ~26MB to ~27MB. Negligible at runtime
(the JS still parses in microseconds), worth a lot for review
process. The 25MB WASM stays where it was — it's already
not-text-readable by nature.
2. THE BARE-SPECIFIER FIX, BUT AGAINST THE RIGHT FILE. The previous
commit sed-patched transformers.web.MIN.js — but offscreen.js
actually loads transformers.web.js after this commit. The minified
sibling never loaded, so the fix never ran. Reported as "still the
same error" by the user.
In the unminified .web.js the bare import is a STATIC import (not
the dynamic form the minifier emits):
import * as ONNX_WEB from "onnxruntime-web/webgpu"; // line 7547
sed -i 's|"onnxruntime-web/webgpu"|"./ort.webgpu.mjs"|' \
src/chrome/vendor/transformers/transformers.web.js
One occurrence, replaced, verified count goes 1→0 with grep.
Why not the "bundle" variant of onnxruntime-web/webgpu (the .bundle
.min.mjs that inlines everything)? It's only available minified.
The plain ort.webgpu.mjs is unminified and has no bare imports of
its own (only Node-specific `node:fs` / `node:os` requires that
never fire in browsers).
Vendor README updated end-to-end:
- "What's here" table reflects the new file names + sizes
- Adds a "Why unminified" callout pointing at store policy
- Update procedure has the new cp + sed lines
- "Files NOT vendored" explains why we skip the .bundle variants
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported same error message but a different specifier:
"Failed to resolve module specifier 'onnxruntime-common'".
Root cause: transformers.web.js has TWO bare specifiers, not one.
The first fix (`onnxruntime-web/webgpu` → relative path) resolved
one, but at line 7605 there's a second:
import { Tensor } from "onnxruntime-common";
onnxruntime-common is a separate npm package providing Tensor +
session types. It's a transitive dep of @huggingface/transformers
(via onnxruntime-web).
Fix: wholesale-vendor its ESM tree, sed-patch the import.
- Copy node_modules/onnxruntime-common/dist/esm/*.js (21 small
files, ~85KB total) into vendor/transformers/onnxruntime-common/.
The ESM tree is self-contained — all inter-file imports are
already relative, no further patches needed.
- sed: "onnxruntime-common" → "./onnxruntime-common/index.js" in
transformers.web.js. One occurrence, replaced, verified.
Also added a defensive whole-tree bare-specifier sweep to the
vendor README's verification step — catches future versions that
introduce a THIRD bare import without needing a debug-runtime
round-trip.
The remaining "@huggingface/transformers" hit at line ~10667 is a
JSDoc example string inside a comment block, not a real import.
README documents this so future maintainers don't get spooked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ONNX Runtime Web dynamically picks a WASM variant at load time. For
Qwen 3 on WebGPU, ops that can't run on the GPU fall back to CPU,
which needs ort-wasm-simd-threaded.asyncify.{mjs,wasm} — without
this pair the runtime errors with "no available backend found,
Failed to fetch dynamically imported module .../asyncify.mjs".
Add both files (~23MB wasm + 47KB loader) and document why .jspi
and the plain variant are still skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
'q4' uses 4-bit weights with fp32 activations. The activation buffers for Qwen 3 0.6B mid-inference overrun the WASM 2GB heap, producing 'std::bad_alloc' out of OrtRun on most laptops. 'q4f16' (4-bit weights + fp16 activations) cuts the activation footprint in half and is the dtype the transformers.js team recommends for Qwen on WebGPU. Update the default in WebGPUProvider, the seed config in providers/manager.js, and the placeholder text in the dtype settings field — both chrome and firefox builds. NOTE: existing users with a stored dtype:'q4' need to either remove and re-add the WebGPU provider, or edit the dtype field in Settings. The first run after switching will re-download ~500MB (q4f16 weights); the old q4 weights stay in IndexedDB but go unused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some Chrome/GPU combos hit 'Integer overflow' from safeint.h during OrtRun on Qwen 3 + q4f16. The mixed-precision quantization kernels take an int32-shape code path that overflows for the model's attention buffer math. fp16 uses single-precision kernels throughout and sidesteps the issue at the cost of ~1.2GB download (vs ~500MB). - Note the workaround in offscreen.js's pipeline-load comment. - Add a Troubleshooting table to the vendor README covering the full error cascade we've walked through: bare-specifier, asyncify-mjs, bad_alloc, integer-overflow, no-backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If WebGPU silently falls back to a software adapter (SwiftShader on Windows when discrete GPU is power-saved, Lavapipe on Linux without a Vulkan driver, etc.), inference burns 500MB on a download then OOMs the WASM heap with std::bad_alloc on first token. From the user's side this looks like dtype/model bugs. Make the offscreen probe call requestAdapter() and report isFallbackAdapter. webgpu.js#testConnection turns that into a specific error message naming chrome://flags. The pipeline loader also logs adapter info + onnx backend keys to the offscreen DevTools console so we can diagnose future "all dtypes OOM" reports without another round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
transformers.js's init code auto-sets wasmPaths to the .asyncify
variant for non-Safari browsers (line ~7786 of transformers.web.js).
The asyncify wasm has Asyncify stack-switching support but NO JSEP
(JavaScript Execution Provider) exports — and the WebGPU EP is
plumbed THROUGH JSEP.
ort.webgpu.mjs calls things like `wasm2.jsepOnCreateSession?.()`
with optional chaining; when those exports are undefined, WebGPU
initialization SILENTLY no-ops. The runtime then runs the entire
model on the WASM CPU backend, blowing the 2GB heap on any sub-1B
model. From the user's side this looks like 'std::bad_alloc on every
dtype' even though chrome://gpu shows WebGPU is hardware-accelerated.
Fix: set wasmPaths to the {mjs, wasm} object form pointing at the
.jsep files. The urlOverride path in ort.webgpu.mjs uses them
directly, bypassing the asyncify default. .jsep.wasm exports the
jsep* functions the WebGPU EP needs.
Add hasWebgpuBackend + wasmPaths to the diagnostic log so a future
regression is one line to spot. Update the troubleshooting table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The asyncify wasm (which is the WebGPU-capable build in
onnxruntime-web 1.20+ — webgpuInit / webgpuRegisterDevice live there,
NOT in the jsep wasm) uses threading for its heap allocator. Threading
needs SharedArrayBuffer. SharedArrayBuffer needs crossOriginIsolated.
That needs cross_origin_embedder_policy + cross_origin_opener_policy
in the manifest.
Without isolation, the wasm falls back to a plain ArrayBuffer heap
that's tiny — and inference std::bad_allocs on any 100MB+ allocation
even when chrome://gpu shows WebGPU is hardware-accelerated and
navigator.gpu hands out a real adapter. Confusing because the surface
error looks like model-too-big rather than configuration.
Add COOP/COEP to the chrome manifest. Also revert the wasmPaths
override to point at .asyncify.{mjs,wasm} (the previous commit
mistakenly pointed at .jsep, which lacks the webgpu* exports and gave
us 'webgpuInit is not a function' instead). Add crossOriginIsolated +
SharedArrayBuffer presence to the diagnostic log so the manifest
change is verifiable without DevTools spelunking.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Past the WASM-heap saga — WebGPU is now actually running the model (input log shows past_conv.0 / past_recurrent.0 state tensors, confirming Qwen 3.5 0.8B's hybrid Mamba+Transformer architecture is executing). The new error is a Dawn-side "Failed to allocate memory for buffer mapping" during mapAsync, which reads GPU buffers back to CPU. Hybrid/vision models like Qwen 3.5 have past_conv, past_recurrent, AND transformer KV cache — that's a lot of buffers to map back and forth between GPU and CPU on every forward pass. Setting preferredOutputLocation: 'gpu-buffer' keeps the outputs as GPU buffers, so the next forward pass can feed them directly without the round-trip and Dawn doesn't run out of mapping staging memory. transformers.js attempts this automatically for kv-cache outputs when the model config provides cache_config, but the wiring doesn't always populate the right names for hybrid/VL models. Setting it globally is the safe override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # manifest.json # package-lock.json # package.json # src/chrome/ARCHITECTURE.md # src/chrome/manifest.json # src/chrome/src/ui/settings.js # src/firefox/ARCHITECTURE.md # src/firefox/manifest.json # src/firefox/src/ui/settings.js # test/run.js
|
@codex what's wrong here? i'm getting: Error: webgpu: The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly. |
Summary
Testing
|
Summary
Adds a new `webgpu` provider type — a fourth "local" provider that runs Qwen 3 0.6B entirely in the browser via WebGPU + ONNX (`@huggingface/transformers`). Unlike llama.cpp / Ollama / LM Studio, it needs nothing installed: model weights download from HuggingFace on first use (~500MB q4), cached in IndexedDB, inference runs in the extension's offscreen document on the user's GPU.
Version bump 7.3.1 → 7.4.0.
Why
The other "local" providers all require the user to install + run a separate server. That's a real onboarding cliff. WebGPU + ONNX gives us a zero-install local option — useful for trying out webbrain without committing to a heavier setup, and as a privacy-preserving fallback when a user just wants to ask quick questions about a page without their data hitting any third party.
Architecture
```
service worker (background)
└─ WebGPUProvider.chat() ──→ chrome.runtime.sendMessage
│
offscreen document ←──────────────────────┘
└─ @huggingface/transformers
pipeline('text-generation', 'onnx-community/Qwen3-0.6B-ONNX',
{ device: 'webgpu', dtype: 'q4' })
```
Tool use
Enabled. Qwen 3's chat template knows how to render `tools=[...]` into the system prompt; the model emits `<tool_call>{...}</tool_call>` blocks. Offscreen.js parses them back into OpenAI-format `tool_calls` so webbrain's loop detector + dispatch treat WebGPU exactly like any other provider.
Reliability at 0.6B is mixed — this is small-model territory. A follow-up UI hint should nudge users toward Ask mode (similar to how we handle the existing small-model warnings).
Streaming
Deferred. v1 returns the full response in one shot. The 0.6B model finishes a normal turn in seconds, so this is acceptable; the background↔offscreen chunked-message router is the kind of plumbing that wants its own PR. A comment in `chatStream()` marks the upgrade target.
Library vendoring
`@huggingface/transformers` is ~5MB JS + ~30MB ONNX-runtime-web WASM. Too big to commit. `src/chrome/vendor/transformers/README.md` documents the one-command vendoring flow:
```bash
npm install @huggingface/transformers
cp node_modules/@huggingface/transformers/dist/transformers.min.js
src/chrome/vendor/transformers/
+ matching ort-wasm-simd-threaded.* files
```
The provider fails fast with a clear "library not vendored" message if the file is missing, so the failure mode is obvious to anyone testing.
`.gitignore` excludes `.js`/`.wasm`/`*.mjs` inside the vendor dirs so an accidental `git add .` doesn't commit 30MB of WASM. The README stays tracked.
Reviewer notes
Test plan
🤖 Generated with Claude Code