pico_cyw43_driver: cybt: re-read transient corrupt ring index#3027
pico_cyw43_driver: cybt: re-read transient corrupt ring index#3027beriberikix wants to merge 1 commit into
Conversation
cybt_get_bt_buf_index() asserts when the BT controller's bt2host / host2bt ring indices read back out of range over the gSPI backplane. Under transient bus contention these indices can read corrupt for a single access, so the assert turns a recoverable glitch into a system abort. Re-read the index up to CYBT_BUF_INDEX_REREAD_MAX (8) times and, if it still does not validate, return CYBT_ERR_HCI_READ_FAILED so the caller can drop the cycle instead of aborting. Reproduced on a CYW43439 (Pico 2 W) under concurrent WiFi+BT load: the assert fired at boot and under sustained traffic; with the re-read the controller recovers and BT stays up. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jonathan Beri <jmberi@gmail.com>
|
How easy is it to reproduce this? |
Reliably, under concurrent WiFi + BT load on the shared gSPI bus — it's load-proportional rather than a rare race. In my soak testing (a BLE connection streaming notifications while WiFi runs a continuous ping load) roughly a third of connection cycles hit the assert before this change, anywhere from a few seconds to ~19 min in. The key detail from the gdb backtrace: it fires even with the cyw43 bus mutex held across the whole BT read. So it isn't a host-side locking gap — the ring index reads back transiently out-of-range at/below the gSPI transport itself (looks like F1-overflow / controller-DMA contention on the BT backplane read while WLAN is active). The corruption doesn't survive a re-read, which is why fixing it at the read site works where more locking didn't. After the change the same workload ran a 2-hour WiFi+BLE soak with zero faults. One caveat: I hit this on a multi-threaded Zephyr port of the driver, where WLAN and BT bus access run as separate threads — more interleaving than the pico-sdk I'm happy to share the gdb backtrace and soak logs if useful. |
The BT data is routed via WiFi. I believe it was done this way to support SPI on Pico devices. So I suspect you need to be careful to ensure there's no concurrent use of WiFi and BT. Is there any locking to prevent this? |
|
Good question, and you're right about the architecture — BT rides the cyw43 backplane and the cyw43 bus mutex is the serialization point. Yes, I do lock it: in my port a shared recursive bus lock is held across each whole BT read-modify-write, so WiFi and BT don't interleave at the host level. Two things I found that are relevant here:
So: locking is necessary (and I'd argue 1. above is a real gap worth fixing), but in my testing it wasn't sufficient on its own — hence this belt-and-braces re-read, which is a no-op on the normal path. |
|
Yes, cyw43_bluetooth_hci_read and cyw43_bluetooth_hci_write should call CYW43_THREAD_ENTER / CYW43_THREAD_EXIT. We always run Wifi and BTStack in the same "scheduler" so presumably never hit a problem. This does suggest a issue with the firmware on the 43439. It would be nice to have a way to reproduce it (ideally just with pico-sdk). But it's unlikely we'd get that fixed quickly - if at all - so your workaround is probably a good idea if it works. |
|
That matches what I see exactly — in a single shared scheduler the mutex is effectively always held and WLAN/BT never preempt mid-transfer, so the window never opens. It's the threaded port (separate WLAN and BT threads) that exposes it, and the missing Two offers to take the load off your side:
And agreed on the firmware — since that's unlikely to change soon, the re-read is the pragmatic cover, and it's a no-op on the normal path so it costs the async_context users nothing even though they'll likely never hit the corruption. |
|
I can try and reproduce it sometime. Feel free to raise a PR on the SDK. |
|
Thanks — I'll raise the SDK PR for the And yes — full transparency: this was done with an AI assistant (Claude Code) in the loop. It helped drive the gdb sessions, characterize the fault, and draft these write-ups. But every claim here is from real hardware, not generated: the backtraces, the For fuller context, this came out of a downstream Zephyr effort to bring up WiFi+BLE coexistence on the CYW43439 over gSPI — this re-read is one of four transport-level fixes. The RFC tracks the whole thing: zephyrproject-rtos/zephyr#111811 |
|
Raised it: georgerobotics/cyw43-driver#153. One note on placement — since |
|
Hardware-validated the companion locking change (georgerobotics/cyw43-driver#153) on a Pico 2 W (RP2350): a patched georgerobotics-backend build brings up BT over the HCI path and holds a 60s BLE notification stream (560 notifications, 0 stalls, no disconnect) under concurrent bidirectional WiFi load, board alive afterward — no fault/assert. Stock pico-sdk + BTstack is unaffected (its transport already wraps these calls); this just makes the functions self-serialising for direct callers. |
What
cybt_get_bt_buf_index()reads the BT controller'shost2bt/bt2hostshared-memory ring indices over the gSPI backplane and
assert()s if anyindex is out of range. The gSPI bus is shared with WiFi traffic, so under
bus contention a single read can return a transiently corrupt (out-of-range)
value. The assert then turns a recoverable glitch into a full system abort.
Change
Re-read the index up to
CYBT_BUF_INDEX_REREAD_MAX(8) times — thecorruption does not persist — and only if it still does not validate return
CYBT_ERR_HCI_READ_FAILED, so the caller drops the cycle gracefully insteadof aborting. No new dependencies; behaviour is unchanged on a valid read.
Testing
Reproduced on a CYW43439 (Raspberry Pi Pico 2 W) under concurrent WiFi+BT
load: the assert fired both at boot and under sustained traffic. With the
re-read the controller recovers and BT stays up across a 2-hour zero-fault
WiFi+BLE coexistence soak.
This is a transport-level robustness fix that benefits all
cybtusers, notjust coexistence.