[BUG] Generator deadlocks after a client disconnects mid-stream — all subsequent requests hang forever while /v1/models keeps returning 200

### OS

Windows

### GPU Library

CUDA 12.x

### Python version

3.11

### Describe the bug

> **Env note:** "Windows" selected only because the dropdown lacks an option for it — actually running under **WSL2 (Ubuntu) on Windows 11**, not native Windows. Python dropdown has no 3.11, but the **actual version is 3.11.15** (please do not read this as 3.12). exllamav3 **0.0.42** (source commit `595d6c4`), tabbyAPI `master` commit `c1655d1`, GPU 2× (12GB + 8GB, tensor split), model `Qwopus3.6-27B-v2-exl3-3.08bpw`.

When an SSE/streaming chat request is **disconnected by the client mid-generation**, the generator appears to deadlock. From that moment on **no request ever completes again** (no further `tokens generated` lines), even though new requests are still accepted and logged with HTTP 200.

Critically, **`GET /v1/models` keeps returning `200`**, so port/HTTP liveness checks and reverse proxies do **not** detect the outage — the server looks healthy but produces zero tokens, with GPU memory still fully allocated and GPU utilization at ~0%. Only a full restart recovers it.

This makes the failure hard to detect and recurrent behind a proxy / Open WebUI, where clients routinely cancel or navigate away mid-generation.

---

### Reproduction steps

1. Run with continuous batching (`max_batch_size: 6`) and a slow-ish model (~14 tok/s here). Relevant `config.yml`:
   ```yaml
   model_name: Qwopus3.6-27B-v2-exl3-3.08bpw
   max_seq_len: 81920
   cache_size: 327680
   cache_mode: 2,2
   max_batch_size: 6
   gpu_split: [12, 8]
2. Send streaming ("stream": true) chat completion requests.
3. Have a client disconnect while a generation is in progress (client read timeout, user hits "stop", or browser tab closed). Server logs ERROR: Request disconnected: /v1/chat/completions.
4. Send further chat completion requests → they are accepted (HTTP 200, Received chat completion streaming request) but never produce a single token.

---

### Expected behavior

```markdown
The disconnected job should be aborted and reaped, other in-flight jobs should continue, and subsequent requests should generate normally. A mid-stream client disconnect must not wedge the whole generator.

### Logs

Last successful generation, then a mid-stream disconnect, then permanent silence (JST):

10:04:23  ...: 210 tokens generated in 15.42 seconds (... Generate: 14.6 T/s, Context: 22203 tokens)   # last completion ever
10:04:43  INFO:  127.0.0.1 - "POST /v1/chat/completions"
10:04:53  ERROR: Request disconnected: /v1/chat/completions                                            # <-- trigger
10:04:53  INFO:  127.0.0.1 - "POST /v1/chat/completions" 200
10:05:03  INFO:  Received chat completion streaming request 44d2d440ddfc49888593d752b66e2ba6           # never completes
10:05:04  INFO:  Received chat completion streaming request 0bcfc967ba73400b8d5f1df5b8092c40           # never completes
... every later POST is accepted but NO "tokens generated" line ever appears again

After this point GPU memory stays fully allocated (model loaded) but GPU utilization sits at ~0%, i.e. the generator is idle/wedged rather than busy.

### Additional context

- exllamav3 commit `a03f0ff` ("AsyncGenerator: Ensure cancel request is forwarded but don't crash if frontend breaks contract", 2026-06-12) **is already included in 0.0.42**, yet this still occurs — so that fix does not fully cover this path.
- exllamav3 `0.0.43` contains only quantization/MTP/conversion changes; nothing in the generator/cancel path.
- Looks related to (but not fixed by) the older closed issues #81 "Error when aborting mid-generation with SSE enabled" and #98 "generate_chat_completion not set abort_event".
- The proxy in front of tabbyAPI uses no client timeout (`ClientTimeout(total=None)`), so the disconnect originates from end clients, not the proxy.
- Reproduced on Python **3.11.15** specifically (not 3.12 — cf. #416). The hang is in the generator abort path, not version-specific.
- **Workaround in use:** a watchdog that sends a tiny real chat-completion probe and restarts the service when generation stops responding for two consecutive checks (port checks are insufficient because `/v1/models` stays up).

### Acknowledgements

- [x] I have looked for similar issues before submitting this one.
- [x] I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
- [x] I understand that the developers have lives and my issue will be answered when possible.
- [x] I understand the developers of this program are human, and I will ask my questions politely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Generator deadlocks after a client disconnects mid-stream — all subsequent requests hang forever while /v1/models keeps returning 200 #428

OS

GPU Library

Python version

Describe the bug

Reproduction steps

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Generator deadlocks after a client disconnects mid-stream — all subsequent requests hang forever while /v1/models keeps returning 200 #428

Description

OS

GPU Library

Python version

Describe the bug

Reproduction steps

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions