Skip to content

Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57

Open
skialpine wants to merge 7 commits into
decentespresso:mainfrom
skialpine:feat/scale-telemetry
Open

Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57
skialpine wants to merge 7 commits into
decentespresso:mainfrom
skialpine:feat/scale-telemetry

Conversation

@skialpine
Copy link
Copy Markdown
Contributor

Summary

Diagnostics for field reports of "weight stops being collected under sustained multi-protocol load" — the only recovery seen was a long battery-out cooldown (a quick power-cycle didn't help), pointing at a thermal/analog failure rather than firmware state. Adds telemetry to confirm/rule it out, with no behavior change to the weight/WiFi/BLE paths.

New fields in the /snapshot WS status frame (and serial logs):

  • soc_temp_c / soc_temp_max_c — live + peak ESP32-S3 die temp (temperatureRead()), sampled every 2 s.
  • weight_stalled / stall_count / last_stall_ms / last_stall_temp_c — a watchdog in pureScale() that flags when the ADS1232 raw value is frozen/railed >8 s (a live cell dithers every sample), recording the die temp at failure to correlate stalls with heat. Skipped during the deliberate ADC power-cycle recovery; throttled to 250 ms.
  • reset_reasonesp_reset_reason() at boot, so a brownout/panic/WDT reset is attributable.

Plus tools/thermal_load_test.sh: a 1-hour USB+WiFi+churn+mDNS soak that polls the telemetry (BT driven externally).

Threading: new cross-task scalars are volatile per CLAUDE.md; the status frame only reads them.

Test plan

  • Builds for esp32s3; flashed; status frame reports the new fields.
  • 1-hour multi-protocol soak to capture peak temp + any stall and the die temp at which it occurs.

🤖 Generated with Claude Code

skialpine and others added 3 commits May 25, 2026 12:24
Diagnostics for the field "weight stops being collected under sustained load"
reports (suspected thermal). Adds to the WS status frame and serial logs:
- soc_temp_c / soc_temp_max_c: live + peak ESP32-S3 die temperature
  (temperatureRead()), sampled every 2s on the main loop.
- weight_stalled + stall_count + last_stall_ms + last_stall_temp_c: a watchdog
  in pureScale() that flags when the ADS1232 raw value is frozen/railed for >8s
  (a live cell dithers every sample), recording the die temp at the moment of
  the stall to correlate failures with heat. (This failure is not firmware-
  recoverable, so it's surfaced, not silently retried.)
- reset_reason: esp_reset_reason() captured at boot, so a brownout/panic/WDT
  reset is attributable instead of looking like a clean power-on.

Telemetry-only; no behavior change to the weight/WiFi/BLE paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nitor)

Drives USB 10Hz + WS 10Hz + HTTP/WS churn + mDNS (BT driven externally) and
polls the new temp/stall telemetry every ~60s, watching for the weight-stall
failure and the die temp at which it occurs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Review follow-ups on the telemetry watchdog:
- Skip the stall check while b_adc_recovery_active (the ADS1232 power-cycle
  freezes the raw value by design); re-seed the window on resume so a genuine
  signal-timeout recovery isn't miscounted as a railed/frozen stall.
- Check every 250 ms instead of every loop iteration -- the ADC only produces
  ~10 samples/s, so polling getDebugInfo() at full loop rate (with its sqrt +
  dataset passes) just burns CPU/heat, which is counterproductive on the chip
  we're trying to characterize.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@skialpine
Copy link
Copy Markdown
Contributor Author

Code review

Reviewed the telemetry diff (deep bug scan + threading/CLAUDE.md check). Found one real issue, now fixed in this PR (commit 8e345d9):

  • Stall watchdog false-trip during ADC recovery — the watchdog read getDebugInfo().rawValue, which is frozen by definition during the firmware's own powerDown()/powerUp() recovery, so a genuine signal-timeout recovery would be miscounted as a railed/frozen stall (corrupting the very metric this adds). Fixed: skip the check while b_adc_recovery_active and re-seed the window on resume.
  • Also (cost): it ran getDebugInfo() (which does a sqrt + dataset passes) every loop iteration though only rawValue is used — wasteful on a chip we're characterizing for heat. Now throttled to 250 ms (the ADC only samples ~10/s).

Verified clean: printf format/arg pairing in both status frames; cross-task reads are volatile (benign torn read only); temperatureRead()/resetReasonStr() safe; at-rest false-positive risk is low (24-bit raw at SAMPLES=1 dithers every sample, so 8 s of byte-identical raw is a genuine freeze).

🤖 Generated with Claude Code

skialpine and others added 4 commits May 25, 2026 12:39
From the toolkit review of PR decentespresso#57:
- temperatureRead() NaN guard: don't poison g_socTempC/Max (NaN -> invalid JSON
  and a frozen peak since NaN compares false); keep last valid + log once.
- g_resetReason is now volatile (CLAUDE.md: cross-task globals read on the
  AsyncTCP status path); status frame casts it for printf.
- Expose adc_recovery_count in the status frame: a *perpetual* ADC recovery loop
  keeps re-seeding the stall window so weight_stalled may never trip -- the
  climbing recovery count makes that failure mode visible. i_adc_recovery_count
  is now volatile (newly read cross-task).
- reset_reason: numeric "unknown_<code>" fallback so unmapped IDF reset reasons
  (CPU_LOCKUP/USB/JTAG) stay attributable.
- Comment fixes: volatile cross-task rationale; stall-window re-seed wording +
  recovery-loop blind-spot note; last_stall_temp_c valid-only-if last_stall_ms.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- DURATION/IP/HOST now positional args; warn (don't silently skip) if no USB port.
- Telemetry monitor logs reset_reason + adc_recovery_count per line, waits a full
  status interval after (re)connect, tracks peak temp / stalls / recoveries /
  reboots across the whole run (so a firmware reset doesn't lose the peak), and
  prints a SUMMARY line with a PASS/FAIL verdict.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses the iteration-2 review findings on PR decentespresso#57:

- Status frame no longer reads the multi-field stopWatch object directly off
  the AsyncTCP task (CLAUDE.md-forbidden cross-task tear, pre-existing). The
  loop task now snapshots it into aligned volatiles (g_timerRunning/
  g_timerElapsed) that both status frames read.
- Widen i_adc_recovery_count uint8_t -> uint32_t and drop the <255 cap so a
  perpetual-recovery loop (the blind spot the stall watchdog can't see) keeps
  counting truthfully over a long soak instead of saturating; update the WS
  format specifier %u -> %lu accordingly.
- SoC temp guard: isfinite() instead of !isnan() so +/-inf can't reach the JSON.
- Stall watchdog: never store 0 as the t_rawChange timestamp (it is the reseed
  sentinel) at boot/rollover.
- README: document the new status-frame telemetry fields.
- thermal_load_test.sh: FAIL (not silent PASS) on sustained loss of status
  frames or a crashed load generator, and exit non-zero on FAIL so it works as
  a CI gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- thermal_load_test.sh: close the silent-PASS holes a review found. A
  flapping wedge (scale answers one frame per reconnect, resetting the
  consecutive-miss streak) now fails via a cumulative total_no_status counter,
  not just max_no_status_streak. Each load generator's PID is captured and
  waited on individually so a never-started/crashed driver (non-zero exit)
  fails the run instead of being missed by a Traceback-only grep. A run that
  never saw soc_temp_max_c (peak stuck at the -999 sentinel) also fails, since
  the thermal data the test exists to capture is absent.
- CLAUDE.md: add "Fixing bugs you find along the way" — pre-existing bugs get
  fixed in the same change, not deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant