fix: 3 c6in.metal bench regressions (AF_XDP fall-back, DPDK ring size, EAL argv split) by skullcrushercmd · Pull Request #5 · AnyVM-Tech/anyscan-engine-c

skullcrushercmd · 2026-04-28T22:55:30Z

Summary

Three engine-side regressions surfaced by the anygpt-52 c6in.metal AWS bench (see AnyVM-Tech/AnyGPT#65 (comment)) — both --io-engine=af_xdp and --io-engine=dpdk were untestable on c6in.metal until these landed. Fixes are split across three focused commits, each with unit / structural test coverage.

Bug A — AF_XDP fall-back segfault (`fix(afxdp): full per-attempt teardown in bind-mode ladder`)

src/send-afxdp.c::afxdp_try_bind did not tear down libxdp state between bind-mode attempts. On AWS ENA the ZEROCOPY attempt returns -EOPNOTSUPP from bind() after libxdp has already attached its xsks_map redirect program in DRV mode; the next attempt's xsk_socket__create then segfaults inside libxdp's program-management layer.

Repro: scanner -i ens1 --io-engine=af_xdp -T 4 -R 4 198.18.0.0/24 1-1024 — segfault on every run.

Fix: new afxdp_full_teardown_after_failed_bind helper that releases xsk_socket__delete + xsk_umem__delete + bpf_xdp_detach (mode=0 to clear regardless of DRV/SKB attach) + zeros all per-thread state. UMEM allocation moved inside afxdp_try_bind so each attempt fully reconstructs UMEM + socket + program attachment.

Coverage: tests/test_afxdp_teardown.sh — 10 structural assertions on src/send-afxdp.c. Verified RED on HEAD~3:src/send-afxdp.c (9 of 10 assertions fail), GREEN on the fix.

Bug B — DPDK `nb_tx_desc=1024` exceeds ENA `nb_max=512` (`fix(dpdk): clamp TX/RX descriptor ring size against PMD nb_max`)

src/dpdk-eal.c hardcoded ANYSCAN_DPDK_TX_RING_SIZE=1024 when calling rte_eth_tx_queue_setup. AWS ENA reports tx_desc_lim.nb_max=512:

Invalid value for nb_tx_desc(=1024), should be: <= 512, >= 128, ...
rte_eth_tx_queue_setup(port=0, q=0) failed: Invalid argument

Fix: new pure helper dpdk_clamp_ring_size(requested, dev_max, dev_min, dev_align) in src/dpdk-ring-clamp.c that honors all three constraints from struct rte_eth_desc_lim. dpdk_eal_bringup calls it on dev_info.{tx,rx}_desc_lim from the existing rte_eth_dev_info_get and logs the clamp at info level. The cap is documented in include/dpdk-defs.h.

Coverage: tests/test_dpdk_clamp.c — 7 unit tests covering the bench repro (1024→512), in-range no-clamp, below-min bump-up, no-limits passthrough, alignment round-down, and alignment round-up-when-needed.

Bug C — EAL argv splitter mangles `--socket-mem 1024` (`fix(dpdk): synthesize stable EAL prog-name; never inject argv[0]`)

src/main.c previously copied argv[0] (scanner binary path) into eal_argv[0]. rte_eal_init may rewrite eal_argv slots as it consumes EAL flags, and the failure-path log printed eal_argv after that rewrite — so the trailing --socket-mem 1024 value token surfaced as --socket-mem scanner in the EAL log.

Fix: extract split_argv_on_dash_dash into its own TU (src/eal-argv-split.c + include/eal-argv-split.h) so it is unit-testable. eal_argv[0] is now the synthesized stable literal \"anyscan-dpdk\" (matching the existing default-argv fallback in dpdk_eal_bringup). Tokens between -- and end-of-argv are preserved strictly, in order.

Coverage: tests/test_eal_argv_split.c — 4 unit tests including the --socket-mem 1024 round-trip repro that asserts argv[0] does not appear anywhere in eal_argv.

What's NOT in this PR

No live c6in.metal bench rerun — that lives in the AnyGPT/anyscan side of the repo (anygpt-52). Once these merge and the worker bundle picks up the new engine commit, the bench can re-run with all three io_engines.

Test plan

make (default build) → builds clean, no new warnings.
make USE_AF_XDP=1 → builds clean.
make USE_DPDK=1 → builds clean.
make USE_AF_XDP=1 USE_DPDK=1 → builds clean.
make USE_AF_XDP=1 USE_DPDK=1 test → 11/11 io_engine_dispatch + 8/8 dpdk_dispatch + 3/3 unit tests pass.
Each fix's test verified RED on the pre-fix code path before being driven to GREEN by the patch (per-commit notes in messages).
Live c6in.metal bench (separate worker; tracked in the AnyGPT-side anygpt-52 follow-up).

Refs: AnyVM-Tech/AnyGPT#65 (comment) AnyVM-Tech/AnyGPT#65 (comment)

🤖 Generated with Claude Code

Bug C from anygpt-52 c6in.metal bench (PR 65 issuecomment-4339242358): the EAL argv splitter previously copied argv[0] (scanner binary path) into eal_argv[0]. rte_eal_init may rewrite eal_argv slots as it consumes EAL flags, and the scanner's failure-path log printed eal_argv after that rewrite — so scanner --io-engine=dpdk … -- --file-prefix=foo --socket-mem 1024 surfaced in the EAL log as EAL argv was: scanner --file-prefix=foo --socket-mem scanner — the trailing 1024 token was clobbered by the scanner program path threading through rte_eal_init's argv mutation. Fix: - Extract split_argv_on_dash_dash from src/main.c into its own TU (src/eal-argv-split.c + include/eal-argv-split.h) so the splitter can be unit-tested without dragging the rest of the scanner in. - eal_argv[0] is now a synthesized stable literal "anyscan-dpdk" — same shape the existing default-argv fallback in dpdk_eal_bringup uses, so the EAL prog-name slot is identical between the two paths and independent of argv lifetime. - Token order between separators is preserved strictly: eal_argv[1..N] are exactly the tokens between `--` and end-of-argv, in order, no replacement. Test (tests/test_eal_argv_split.c, run via make unit-tests): - test_no_dash_dash: no `--` → eal_argc=0, eal_argv=NULL. - test_socket_mem_round_trip: reproduces the bench regression and asserts eal_argv = [anyscan-dpdk, --file-prefix=foo, --socket-mem, 1024, NULL] with argv[0] not appearing anywhere in eal_argv. - test_dash_dash_with_no_eal_args: synthesized prog-name still present even with no user EAL tokens. - test_realistic_eal_argv_shape: -l 0-7 --socket-mem 1024 --file-prefix=run42 --no-shconf round-trips intact. Verified RED: temporarily reverting eal_argv[0] = argv[0] makes 4 assertions fail (`expected 'anyscan-dpdk', got 'scanner'`); restoring the fix returns all 4 to GREEN. Refs: AnyVM-Tech/AnyGPT#65 issuecomment-4339242358 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bug B from anygpt-52 c6in.metal bench (PR 65 issuecomment-4339242358): src/dpdk-eal.c hardcoded nb_tx_desc=ANYSCAN_DPDK_TX_RING_SIZE=1024 when calling rte_eth_tx_queue_setup. AWS ENA reports tx_desc_lim.nb_max=512, so the call returned Invalid value for nb_tx_desc(=1024), should be: <= 512, >= 128, ... rte_eth_tx_queue_setup(port=0, q=0) failed: Invalid argument before any TX traffic could leave the box. Same shape on the RX side. Fix: - New pure helper dpdk_clamp_ring_size in src/dpdk-ring-clamp.c that clamps a requested ring size against the three constraints in struct rte_eth_desc_lim: nb_max, nb_min, nb_align (any field reported as 0 means the PMD did not advertise that constraint and is skipped). - dpdk_eal_bringup now calls the helper using dev_info.{tx,rx}_desc_lim from the existing rte_eth_dev_info_get call before each rte_eth_*queue_setup, and logs the clamp at info level so the operator sees the active ring size. - include/dpdk-defs.h documents the per-PMD cap (ENA: nb_max=512; i40e/ixgbe/mlx5: 4096..8192) so the constant choice is self- explanatory. Test (tests/test_dpdk_clamp.c, run via make unit-tests): - test_ena_{tx,rx}_clamps_1024_to_512: the bench regression repro. - test_in_range_no_clamp: 512/256/128 against ENA limits → no change. - test_below_min_bumps_to_min: 64 → 128. - test_no_limits_passthrough: PMD reports all-zero limits → no clamp. - test_alignment_rounds_down: align=64 with max=600 → 576. - test_alignment_rounds_up_when_needed: align=64 with min=200 → rounds 200 up to 256 (the only valid value). Verified RED: pre-source the test fails to link with implicit-function- declaration and missing src/dpdk-ring-clamp.c. Adding the source + header declaration drives 7/7 to GREEN. Build verification on a host with libdpdk-dev installed: make USE_DPDK=1 test → 11/11 io_engine_dispatch + 8/8 dpdk_dispatch + 2/2 unit tests pass. Refs: AnyVM-Tech/AnyGPT#65 issuecomment-4339242358 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bug A from anygpt-52 c6in.metal bench (PR 65 issuecomment-4339242358): on AWS ENA the AF_XDP fall-back ladder (drv+zerocopy → drv+copy → skb) segfaults on the second attempt. The first xsk_socket__create returns -EOPNOTSUPP from bind() *after* libxdp has already attached its xsks_map redirect program in DRV mode; the next attempt's xsk_socket__create then segfaults inside libxdp's program-management layer trying to manage a stale program slot. This made --io-engine=af_xdp untestable on c6in.metal. Repro: scanner -i ens1 --io-engine=af_xdp -T 4 -R 4 198.18.0.0/24 1-1024 Fix: every failed attempt fully tears down what libxdp may have set up before returning, so the next attempt sees a fully clean interface. - New helper afxdp_full_teardown_after_failed_bind that releases the four primitives required for clean state: 1. xsk_socket__delete (release partially-bound socket) 2. xsk_umem__delete (release UMEM rings) 3. bpf_xdp_detach (mode=0) (clear interface XDP program slot, regardless of DRV vs SKB attach) 4. zero out s->{xsk, umem, xsk_fd, free_*, bound_mode, rings} plus free of the per-thread heap (free_stack, umem_area). - afxdp_try_bind now owns UMEM lifetime per attempt: allocates fresh via afxdp_alloc_umem at the top, calls the teardown helper on xsk_socket__create failure, returns -1. - afxdp_tx_init_per_thread no longer pre-allocates UMEM outside the bind ladder — the bind helper handles it. The redundant teardown in the "ladder exhausted" branch is gone (already torn down by the last failed attempt; only the outer struct itself remains to free). Includes pulled in: <bpf/libbpf.h> (bpf_xdp_detach), <net/if.h> (if_nametoindex). The libxdp/libbpf -dev packages already provide both; no Makefile change needed since the AF_XDP build links libbpf via pkg-config. Test (tests/test_afxdp_teardown.sh, run via make unit-tests): Structural assertions on src/send-afxdp.c — libxdp behaviour requires a NIC + root to exercise live, so this verifies that the source contains the teardown primitives in the right places. A future regression that removes any of them fails this test without needing to reproduce the segfault on hardware. 10 assertions across 4 categories: helper exists, helper calls all four primitives + state zero, afxdp_try_bind dispatches into the helper, single afxdp_alloc_umem call site (proving the UMEM allocation moved inside try_bind). Verified RED: running the structural test against the pre-fix src/send-afxdp.c (HEAD~2) fails 9 of the 10 assertions; restoring the fix returns all 10 to GREEN. Build verification on a host with libxdp+libbpf installed: make USE_AF_XDP=1 USE_DPDK=1 test → 11/11 io_engine_dispatch + 8/8 dpdk_dispatch + 3/3 unit tests pass. No new compile warnings. Refs: AnyVM-Tech/AnyGPT#65 issuecomment-4339242358 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f04c313171

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T22:58:13Z

+             * Best-effort: the rc is intentionally ignored — if no
+             * program is attached, bpf_xdp_detach returns -ENOENT which
+             * is the desired no-op. */
+            (void)bpf_xdp_detach(ifindex, 0, NULL);


Avoid detaching shared XDP program on bind retry failure

Calling bpf_xdp_detach(ifindex, 0, NULL) in the retry teardown removes whatever XDP program is attached to the interface, not just state from the failed attempt. In this codebase, run_scan starts sender threads as it iterates (src/engine.c), so a later sender’s failed zerocopy attempt can detach the program that earlier senders are already using, causing reply drops or broken AF_XDP receive routing mid-initialization; the same path also removes any pre-existing non-scanner XDP program on that interface.

Useful? React with 👍 / 👎.

skullcmd and others added 3 commits April 28, 2026 22:37

skullcrushercmd merged commit a22c8f2 into main Apr 28, 2026

skullcrushercmd deleted the fix/c6in-metal-bench-regressions branch April 28, 2026 22:56

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 3 c6in.metal bench regressions (AF_XDP fall-back, DPDK ring size, EAL argv split)#5

fix: 3 c6in.metal bench regressions (AF_XDP fall-back, DPDK ring size, EAL argv split)#5
skullcrushercmd merged 3 commits into
mainfrom
fix/c6in-metal-bench-regressions

skullcrushercmd commented Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skullcrushercmd commented Apr 28, 2026

Summary

Bug A — AF_XDP fall-back segfault (fix(afxdp): full per-attempt teardown in bind-mode ladder)

Bug B — DPDK nb_tx_desc=1024 exceeds ENA nb_max=512 (fix(dpdk): clamp TX/RX descriptor ring size against PMD nb_max)

Bug C — EAL argv splitter mangles --socket-mem 1024 (fix(dpdk): synthesize stable EAL prog-name; never inject argv[0])

What's NOT in this PR

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug A — AF_XDP fall-back segfault (`fix(afxdp): full per-attempt teardown in bind-mode ladder`)

Bug B — DPDK `nb_tx_desc=1024` exceeds ENA `nb_max=512` (`fix(dpdk): clamp TX/RX descriptor ring size against PMD nb_max`)

Bug C — EAL argv splitter mangles `--socket-mem 1024` (`fix(dpdk): synthesize stable EAL prog-name; never inject argv[0]`)