Skip to content

pico_cyw43_driver: cybt: re-read transient corrupt ring index#3027

Open
beriberikix wants to merge 1 commit into
raspberrypi:developfrom
beriberikix:cybt-reread-corrupt-index
Open

pico_cyw43_driver: cybt: re-read transient corrupt ring index#3027
beriberikix wants to merge 1 commit into
raspberrypi:developfrom
beriberikix:cybt-reread-corrupt-index

Conversation

@beriberikix

Copy link
Copy Markdown

What

cybt_get_bt_buf_index() reads the BT controller's host2bt / bt2host
shared-memory ring indices over the gSPI backplane and assert()s if any
index is out of range. The gSPI bus is shared with WiFi traffic, so under
bus contention a single read can return a transiently corrupt (out-of-range)
value. The assert then turns a recoverable glitch into a full system abort.

Change

Re-read the index up to CYBT_BUF_INDEX_REREAD_MAX (8) times — the
corruption does not persist — and only if it still does not validate return
CYBT_ERR_HCI_READ_FAILED, so the caller drops the cycle gracefully instead
of aborting. No new dependencies; behaviour is unchanged on a valid read.

Testing

Reproduced on a CYW43439 (Raspberry Pi Pico 2 W) under concurrent WiFi+BT
load: the assert fired both at boot and under sustained traffic. With the
re-read the controller recovers and BT stays up across a 2-hour zero-fault
WiFi+BLE coexistence soak.

This is a transport-level robustness fix that benefits all cybt users, not
just coexistence.

cybt_get_bt_buf_index() asserts when the BT controller's bt2host /
host2bt ring indices read back out of range over the gSPI backplane.
Under transient bus contention these indices can read corrupt for a
single access, so the assert turns a recoverable glitch into a system
abort. Re-read the index up to CYBT_BUF_INDEX_REREAD_MAX (8) times and,
if it still does not validate, return CYBT_ERR_HCI_READ_FAILED so the
caller can drop the cycle instead of aborting.

Reproduced on a CYW43439 (Pico 2 W) under concurrent WiFi+BT load: the
assert fired at boot and under sustained traffic; with the re-read the
controller recovers and BT stays up.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Jonathan Beri <jmberi@gmail.com>
@peterharperuk

Copy link
Copy Markdown
Contributor

How easy is it to reproduce this?

@beriberikix

Copy link
Copy Markdown
Author

How easy is it to reproduce this?

Reliably, under concurrent WiFi + BT load on the shared gSPI bus — it's load-proportional rather than a rare race. In my soak testing (a BLE connection streaming notifications while WiFi runs a continuous ping load) roughly a third of connection cycles hit the assert before this change, anywhere from a few seconds to ~19 min in.

The key detail from the gdb backtrace: it fires even with the cyw43 bus mutex held across the whole BT read. So it isn't a host-side locking gap — the ring index reads back transiently out-of-range at/below the gSPI transport itself (looks like F1-overflow / controller-DMA contention on the BT backplane read while WLAN is active). The corruption doesn't survive a re-read, which is why fixing it at the read site works where more locking didn't. After the change the same workload ran a 2-hour WiFi+BLE soak with zero faults.

One caveat: I hit this on a multi-threaded Zephyr port of the driver, where WLAN and BT bus access run as separate threads — more interleaving than the pico-sdk async_context model. So it may be harder to trigger on stock pico-sdk + BTstack. But since the change is behaviour-neutral on a valid read (the first-read success path is unchanged), it's pure downside protection regardless of how often it triggers.

I'm happy to share the gdb backtrace and soak logs if useful.

@peterharperuk

Copy link
Copy Markdown
Contributor

where WLAN and BT bus access run as separate threads

The BT data is routed via WiFi. I believe it was done this way to support SPI on Pico devices. So I suspect you need to be careful to ensure there's no concurrent use of WiFi and BT. Is there any locking to prevent this?

@beriberikix

beriberikix commented Jun 24, 2026

Copy link
Copy Markdown
Author

Good question, and you're right about the architecture — BT rides the cyw43 backplane and the cyw43 bus mutex is the serialization point. Yes, I do lock it: in my port a shared recursive bus lock is held across each whole BT read-modify-write, so WiFi and BT don't interleave at the host level.

Two things I found that are relevant here:

  1. While tracing this I hit a genuine locking gap in the stock integration: cyw43_bluetooth_hci_write() is the one bus path that isn't wrapped in CYW43_THREAD_ENTER, so a BT HCI TX (e.g. a notification ACL) can drive SPI outside the lock, concurrent with the poll thread — and the cyw43 bus code yields/sleeps mid-transfer, so they do interleave. Closing that removed a whole class of these faults. It might be worth a look in pico-sdk independently of this PR.

  2. But the case this PR covers survived even with locking complete. gdb caught the assert on the BT TX thread with the cyw43 bus mutex held (lock_count=1) — i.e. no concurrent host-side bus access at that moment — and the ring index still read back out of range. So my read is that the glitch is below the host lock: a transient on the gSPI backplane read of the controller's shared memory (looks like F1-overflow / on-chip DMA contention), not something the mutex can prevent. That's why I went with a re-read at the read site rather than more locking.

So: locking is necessary (and I'd argue 1. above is a real gap worth fixing), but in my testing it wasn't sufficient on its own — hence this belt-and-braces re-read, which is a no-op on the normal path.

@peterharperuk

Copy link
Copy Markdown
Contributor

Yes, cyw43_bluetooth_hci_read and cyw43_bluetooth_hci_write should call CYW43_THREAD_ENTER / CYW43_THREAD_EXIT. We always run Wifi and BTStack in the same "scheduler" so presumably never hit a problem.

This does suggest a issue with the firmware on the 43439. It would be nice to have a way to reproduce it (ideally just with pico-sdk). But it's unlikely we'd get that fixed quickly - if at all - so your workaround is probably a good idea if it works.

@beriberikix

Copy link
Copy Markdown
Author

That matches what I see exactly — in a single shared scheduler the mutex is effectively always held and WLAN/BT never preempt mid-transfer, so the window never opens. It's the threaded port (separate WLAN and BT threads) that exposes it, and the missing CYW43_THREAD_ENTER is what makes the threaded case unsafe until it's wrapped.

Two offers to take the load off your side:

  1. I'm happy to send a small separate PR adding CYW43_THREAD_ENTER/CYW43_THREAD_EXIT to cyw43_bluetooth_hci_read/write. It's a no-op for the async_context model and just makes those two functions safe for non-cooperative integrations.

  2. On a pico-sdk-only repro: I can try, but I'll be upfront that all my evidence is from the threaded port, and in the single-scheduler model it may genuinely never fire (or only on a true firmware-side backplane glitch, which would be rare). If it'd help your decision, I'd point a stock picow_bt example at heavy concurrent WLAN throughput (e.g. iperf) while BT streams, run it under SWD, and report back either way.

And agreed on the firmware — since that's unlikely to change soon, the re-read is the pragmatic cover, and it's a no-op on the normal path so it costs the async_context users nothing even though they'll likely never hit the corruption.

@peterharperuk

Copy link
Copy Markdown
Contributor

I can try and reproduce it sometime. Feel free to raise a PR on the SDK.
BTW: Was this analysis by an AI? Not that it matters - just interested.

@beriberikix

Copy link
Copy Markdown
Author

Thanks — I'll raise the SDK PR for the CYW43_THREAD_ENTER/CYW43_THREAD_EXIT wrapping shortly.

And yes — full transparency: this was done with an AI assistant (Claude Code) in the loop. It helped drive the gdb sessions, characterize the fault, and draft these write-ups. But every claim here is from real hardware, not generated: the backtraces, the lock_count=1 capture at the fault, and the soak runs are all actual captures on a Pico 2 W. Happy to share any of the raw logs.

For fuller context, this came out of a downstream Zephyr effort to bring up WiFi+BLE coexistence on the CYW43439 over gSPI — this re-read is one of four transport-level fixes. The RFC tracks the whole thing: zephyrproject-rtos/zephyr#111811

@beriberikix

Copy link
Copy Markdown
Author

Raised it: georgerobotics/cyw43-driver#153.

One note on placement — since cyw43_bluetooth_hci_read/write are defined in the cyw43-driver submodule rather than the SDK tree, I put the PR there. The SDK's own btstack transport already wraps both calls in CYW43_THREAD_ENTER/EXIT, so stock pico-sdk + BTstack is unaffected; this just makes the functions self-serialising for direct API callers (non-btstack / multi-threaded ports). Happy to move or reshape it if you'd rather handle it SDK-side.

@beriberikix

Copy link
Copy Markdown
Author

Hardware-validated the companion locking change (georgerobotics/cyw43-driver#153) on a Pico 2 W (RP2350): a patched georgerobotics-backend build brings up BT over the HCI path and holds a 60s BLE notification stream (560 notifications, 0 stalls, no disconnect) under concurrent bidirectional WiFi load, board alive afterward — no fault/assert. Stock pico-sdk + BTstack is unaffected (its transport already wraps these calls); this just makes the functions self-serialising for direct callers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants