OS
Windows
GPU Library
CUDA 12.x
Python version
3.11
Describe the bug
Env note: "Windows" selected only because the dropdown lacks an option for it — actually running under WSL2 (Ubuntu) on Windows 11, not native Windows. Python dropdown has no 3.11, but the actual version is 3.11.15 (please do not read this as 3.12). exllamav3 0.0.42 (source commit 595d6c4), tabbyAPI master commit c1655d1, GPU 2× (12GB + 8GB, tensor split), model Qwopus3.6-27B-v2-exl3-3.08bpw.
When an SSE/streaming chat request is disconnected by the client mid-generation, the generator appears to deadlock. From that moment on no request ever completes again (no further tokens generated lines), even though new requests are still accepted and logged with HTTP 200.
Critically, GET /v1/models keeps returning 200, so port/HTTP liveness checks and reverse proxies do not detect the outage — the server looks healthy but produces zero tokens, with GPU memory still fully allocated and GPU utilization at ~0%. Only a full restart recovers it.
This makes the failure hard to detect and recurrent behind a proxy / Open WebUI, where clients routinely cancel or navigate away mid-generation.
Reproduction steps
- Run with continuous batching (
max_batch_size: 6) and a slow-ish model (~14 tok/s here). Relevant config.yml:
model_name: Qwopus3.6-27B-v2-exl3-3.08bpw
max_seq_len: 81920
cache_size: 327680
cache_mode: 2,2
max_batch_size: 6
gpu_split: [12, 8]
- Send streaming ("stream": true) chat completion requests.
- Have a client disconnect while a generation is in progress (client read timeout, user hits "stop", or browser tab closed). Server logs ERROR: Request disconnected: /v1/chat/completions.
- Send further chat completion requests → they are accepted (HTTP 200, Received chat completion streaming request) but never produce a single token.
Expected behavior
The disconnected job should be aborted and reaped, other in-flight jobs should continue, and subsequent requests should generate normally. A mid-stream client disconnect must not wedge the whole generator.
### Logs
Last successful generation, then a mid-stream disconnect, then permanent silence (JST):
10:04:23 ...: 210 tokens generated in 15.42 seconds (... Generate: 14.6 T/s, Context: 22203 tokens) # last completion ever
10:04:43 INFO: 127.0.0.1 - "POST /v1/chat/completions"
10:04:53 ERROR: Request disconnected: /v1/chat/completions # <-- trigger
10:04:53 INFO: 127.0.0.1 - "POST /v1/chat/completions" 200
10:05:03 INFO: Received chat completion streaming request 44d2d440ddfc49888593d752b66e2ba6 # never completes
10:05:04 INFO: Received chat completion streaming request 0bcfc967ba73400b8d5f1df5b8092c40 # never completes
... every later POST is accepted but NO "tokens generated" line ever appears again
After this point GPU memory stays fully allocated (model loaded) but GPU utilization sits at ~0%, i.e. the generator is idle/wedged rather than busy.
### Additional context
- exllamav3 commit `a03f0ff` ("AsyncGenerator: Ensure cancel request is forwarded but don't crash if frontend breaks contract", 2026-06-12) **is already included in 0.0.42**, yet this still occurs — so that fix does not fully cover this path.
- exllamav3 `0.0.43` contains only quantization/MTP/conversion changes; nothing in the generator/cancel path.
- Looks related to (but not fixed by) the older closed issues #81 "Error when aborting mid-generation with SSE enabled" and #98 "generate_chat_completion not set abort_event".
- The proxy in front of tabbyAPI uses no client timeout (`ClientTimeout(total=None)`), so the disconnect originates from end clients, not the proxy.
- Reproduced on Python **3.11.15** specifically (not 3.12 — cf. #416). The hang is in the generator abort path, not version-specific.
- **Workaround in use:** a watchdog that sends a tiny real chat-completion probe and restarts the service when generation stops responding for two consecutive checks (port checks are insufficient because `/v1/models` stays up).
### Acknowledgements
- [x] I have looked for similar issues before submitting this one.
- [x] I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
- [x] I understand that the developers have lives and my issue will be answered when possible.
- [x] I understand the developers of this program are human, and I will ask my questions politely.
OS
Windows
GPU Library
CUDA 12.x
Python version
3.11
Describe the bug
When an SSE/streaming chat request is disconnected by the client mid-generation, the generator appears to deadlock. From that moment on no request ever completes again (no further
tokens generatedlines), even though new requests are still accepted and logged with HTTP 200.Critically,
GET /v1/modelskeeps returning200, so port/HTTP liveness checks and reverse proxies do not detect the outage — the server looks healthy but produces zero tokens, with GPU memory still fully allocated and GPU utilization at ~0%. Only a full restart recovers it.This makes the failure hard to detect and recurrent behind a proxy / Open WebUI, where clients routinely cancel or navigate away mid-generation.
Reproduction steps
max_batch_size: 6) and a slow-ish model (~14 tok/s here). Relevantconfig.yml:Expected behavior