Skip to content

fix: prevent WS-broadcast OOM crash under connection churn#58

Merged
tadelv merged 1 commit into
decentespresso:mainfrom
skialpine:fix/ws-oom-only
May 26, 2026
Merged

fix: prevent WS-broadcast OOM crash under connection churn#58
tadelv merged 1 commit into
decentespresso:mainfrom
skialpine:fix/ws-oom-only

Conversation

@skialpine
Copy link
Copy Markdown
Contributor

Independent of PR #57 (the scale-telemetry PR). They share two files (include/websocket.h, CLAUDE.md) but the changes don't overlap semantically — both can land in either order. This PR carries the bug fix alone, against main.

Root cause (from a captured + decoded panic backtrace)

Under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses. The broadcast path then allocates an AsyncWebSocketMessage per client and operator new throws std::bad_alloc; Arduino-ESP32 builds -fno-exceptions, so the throw goes to std::terminate()abort() → reboot (reset_reason=panic). This is the "weight stops being collected under load" failure — not thermal (die temp was 33 °C).

Decoded stack:

operator new -> __cxa_throw -> std::terminate -> abort
AsyncWebSocketClient::_queueMessage   (AsyncWebSocket.cpp:490)
AsyncWebSocket::printfAll
sendWebsocketWeightAll                 (include/websocket.h, loop() 10 Hz broadcast)

The existing 15 KB heap watchdog (wifi_setup.cpp) can't prevent it: it has a 2 s debounce and defers reboot up to 60 s while BLE is connected, so the 10 Hz allocation bad_allocs long before it acts.

Fix (38 lines added, 0 removed)

  • wsBroadcastHeapOk() heap-floor gate on every broadcast-to-all helper (sendWebsocketWeightAll, sendWebsocketStatusAll, button, power-off): when free heap is below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB watchdog) the frame is skipped, not allocated. Dropping a frame is invisible (next weight frame ≤500 ms away).
  • -D WS_MAX_QUEUED_MESSAGES=8 (lib default 32): bounds each client's outbound queue so a backed-up/half-open client can't hoard heap.
  • CLAUDE.md: documented the footgun (notes + troubleshooting table).

Verification (on hardware)

Re-ran the exact load that crashed the unpatched build — conn_churn --rst 8×8 + 10 Hz WS + mDNS, BT connected:

  • Free heap driven to 6436 bytes (old build crashed at ~4684).
  • Gate engaged: [ws] low heap 17736 < 25000 -> skip broadcast.
  • No abort, no reboot, weight stream uninterrupted (uptime continuous through 3500+ churn cycles).

Separately, a 2-hour sustained soak (full multi-protocol load) on this fix: 0 stalls, 0 reboots, 0 lost frames, peak SoC 50.3 °C.

🤖 Generated with Claude Code

Root-caused from a captured panic backtrace: under sustained multi-client
WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap
collapses and AsyncWebSocket's printfAll path allocates an
AsyncWebSocketMessage per client -> operator new throws std::bad_alloc ->
(Arduino-ESP32 is -fno-exceptions) std::terminate() -> abort() -> reboot.
That OOM-reboot is the "weight stops being collected under load" failure
(not thermal -- die temp was 33 C). Decoded stack:

  operator new -> __cxa_throw -> std::terminate -> abort
  AsyncWebSocketClient::_queueMessage (AsyncWebSocket.cpp:490)
  AsyncWebSocket::printfAll
  sendWebsocketWeightAll (websocket.h)  <- loop() 10 Hz broadcast

Fix:
- Heap-gate every broadcast-to-all helper (weight, status, button,
  power-off) with wsBroadcastHeapOk(): skip the frame when free heap is
  below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB heap watchdog)
  instead of allocating into an exhausted heap and crashing. Dropping a
  frame is invisible; the next is <=500 ms away.
- Cap each client's outbound queue via -D WS_MAX_QUEUED_MESSAGES=8 (lib
  default 32) so a backed-up/half-open client can't hoard heap.
- Document the footgun in CLAUDE.md (notes + troubleshooting table).

Verified on hardware: under the exact crashing load (conn_churn --rst 8x8
+ 10 Hz WS + mDNS + BT) free heap was driven to 6436 bytes (old build
died at 4684), the gate engaged ([ws] low heap 17736 < 25000 -> skip
broadcast), and there was no abort, no reboot, weight stream continuous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tadelv tadelv merged commit adac24a into decentespresso:main May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants