Summary
After a routine proxy restart (host reboot, rollout), record Fetch through the kafSCALE proxy returns 0 records to every consumer, while metadata / ListOffsets keep working correctly. No client error is raised — the consumer just idle-times-out. The data is durable (segments in S3, high-watermark correct); only the record-Fetch read path is broken, and it stays broken until a manual coordinated broker + proxy restart.
This is a release-grade reliability concern: a routine ops action silently stalls the core read path for all consumers.
Observed on er1-bdr (KIND on master2): broker v1.6.0-readfix + proxy v1.6.0. Reproduced from the host, a compose container, and a remote client.
Per ADR-0002 the Kafka entrypoint is always the proxy; clients never talk to the broker directly. We do not want a Kafka wire-protocol change — the fix should land in the proxy (and/or as a version/build alignment), not in a broker wire widening.
Symptom
ListOffsets / kafka-get-offsets: correct HWM (e.g. er1a 495/521/6 = 1022). OK.
Consume (franz-go group-less, cp-kafka, remote): 0 records, no client error.
Proxy log on each fetch:
WARN fetch forward failed target=scalytics-primary-broker...:9092 error="read frame size: EOF"
WARN fetch forward failed target=scalytics-primary-broker...:9092 error="decode fetch response v13: response did not contain enough data to be valid"
The proxy forwards to the correct internal broker DNS (not a self-loop). The broker returns a partial/EOF v13 response.
Root cause analysis (from source)
Three factors compound:
1. The proxy is a full Fetch codec, not a byte forwarder.
forwardFetch (cmd/proxy/main.go) decodes the broker's FetchResponse, fans sub-requests out per broker, merges them, and re-encodes the response (protocol.EncodeResponse). So any mismatch between broker-side encode and proxy-side decode of a Fetch response breaks consume — the proxy cannot parse the broker's reply.
2. Fetch v13 + a build/version drift between broker and proxy.
The broker advertises Fetch min 11 / max 13 (cmd/broker/main.go: {key: protocol.APIKeyFetch, minVersion: 11, maxVersion: 13}), so clients negotiate v13. In the failing setup the broker ran a fork tag (v1.6.0-readfix) while the proxy was plain v1.6.0 — exactly the v13 (de)serialization is not guaranteed to match between those builds:
read frame size: EOF (pkg/protocol/frame.go ReadFrame) = the broker returned nothing / closed the connection.
decode fetch response v13: not enough data = the proxy read a frame, but it was too short to be a valid v13 FetchResponse.
A matched official v1.6.0 pair was previously byte-clean in a group-less franz-go round-trip (our read-path report, 2026-06-04), which points to the drift as the trigger.
3. The proxy does not recover — it retries only the wrong error class.
forwardFetch's retry loop retries only NOT_LEADER_OR_FOLLOWER partitions. A connection / decode failure is mapped to REQUEST_TIMED_OUT and is not retried on a fresh connection. The bad connection is closed (fanOutFetch), but the fetch is reported as failed, so the client sees an empty read. The coordinated restart only "heals" it by forcing broker and proxy back into a freshly-matched state.
Conclusion: not data loss, not a broken storage path. It is a Fetch-v13 compatibility / connection-state gap between a fork-broker and a release-proxy that the proxy does not absorb.
Workaround (verified)
Coordinated broker + proxy restart re-negotiates a matched pair and restores consume:
kubectl -n kafscale delete pod scalytics-primary-broker-0
kubectl -n kafscale wait --for=condition=ready pod/scalytics-primary-broker-0 --timeout=120s
kubectl -n kafscale rollout restart deploy/kafscale-proxy
# -> consume returns records again (verified: 1022 records, 0 failed)
What we want to achieve / proposed direction (discussion)
No wire-protocol change; fix in proxy / build / chart; always via the proxy.
- Matched build pair. Proxy and broker built from the same source/tag; eliminate the
-readfix fork-vs-release drift so v13 Fetch (de)serialization is guaranteed compatible. Cheapest, highest-leverage, likely the actual trigger.
- Proxy resilience to a stale/half-open backend connection. On a frame / EOF / decode error, discard the connection and retry on a fresh one (extend the retry beyond
NOT_LEADER_OR_FOLLOWER to connection-level failures) instead of silently returning REQUEST_TIMED_OUT.
- (Optional) Down-negotiate the forwarded Fetch version to the broker's reliably-served version rather than blindly forwarding v13.
- Regression smoke: restart the proxy, then assert a consume round-trip still returns records.
Questions for discussion
- Is the
-readfix broker tag an outdated fork, and is the matched-pair drift the accepted root cause — or is there a known v13 encode/decode incompatibility in the current release pair regardless of build?
- Would maintainers prefer the proxy to (2) auto-recover connection-level Fetch failures, and/or (3) pin/down-negotiate the forwarded Fetch version? Either keeps the wire protocol unchanged.
- Is there appetite for the proxy-restart consume-regression smoke as a CI gate?
Happy to provide the full read-path report and reproduction artifacts.
Summary
After a routine proxy restart (host reboot, rollout), record Fetch through the kafSCALE proxy returns 0 records to every consumer, while metadata /
ListOffsetskeep working correctly. No client error is raised — the consumer just idle-times-out. The data is durable (segments in S3, high-watermark correct); only the record-Fetch read path is broken, and it stays broken until a manual coordinated broker + proxy restart.This is a release-grade reliability concern: a routine ops action silently stalls the core read path for all consumers.
Observed on
er1-bdr(KIND on master2): brokerv1.6.0-readfix+ proxyv1.6.0. Reproduced from the host, a compose container, and a remote client.Per ADR-0002 the Kafka entrypoint is always the proxy; clients never talk to the broker directly. We do not want a Kafka wire-protocol change — the fix should land in the proxy (and/or as a version/build alignment), not in a broker wire widening.
Symptom
ListOffsets/kafka-get-offsets: correct HWM (e.g.er1a495/521/6= 1022). OK.Consume (franz-go group-less, cp-kafka, remote): 0 records, no client error.
Proxy log on each fetch:
The proxy forwards to the correct internal broker DNS (not a self-loop). The broker returns a partial/EOF v13 response.
Root cause analysis (from source)
Three factors compound:
1. The proxy is a full Fetch codec, not a byte forwarder.
forwardFetch(cmd/proxy/main.go) decodes the broker'sFetchResponse, fans sub-requests out per broker, merges them, and re-encodes the response (protocol.EncodeResponse). So any mismatch between broker-side encode and proxy-side decode of a Fetch response breaks consume — the proxy cannot parse the broker's reply.2. Fetch v13 + a build/version drift between broker and proxy.
The broker advertises Fetch min 11 / max 13 (
cmd/broker/main.go:{key: protocol.APIKeyFetch, minVersion: 11, maxVersion: 13}), so clients negotiate v13. In the failing setup the broker ran a fork tag (v1.6.0-readfix) while the proxy was plainv1.6.0— exactly the v13 (de)serialization is not guaranteed to match between those builds:read frame size: EOF(pkg/protocol/frame.goReadFrame) = the broker returned nothing / closed the connection.decode fetch response v13: not enough data= the proxy read a frame, but it was too short to be a valid v13FetchResponse.A matched official
v1.6.0pair was previously byte-clean in a group-less franz-go round-trip (our read-path report, 2026-06-04), which points to the drift as the trigger.3. The proxy does not recover — it retries only the wrong error class.
forwardFetch's retry loop retries onlyNOT_LEADER_OR_FOLLOWERpartitions. A connection / decode failure is mapped toREQUEST_TIMED_OUTand is not retried on a fresh connection. The bad connection is closed (fanOutFetch), but the fetch is reported as failed, so the client sees an empty read. The coordinated restart only "heals" it by forcing broker and proxy back into a freshly-matched state.Conclusion: not data loss, not a broken storage path. It is a Fetch-v13 compatibility / connection-state gap between a fork-broker and a release-proxy that the proxy does not absorb.
Workaround (verified)
Coordinated broker + proxy restart re-negotiates a matched pair and restores consume:
What we want to achieve / proposed direction (discussion)
No wire-protocol change; fix in proxy / build / chart; always via the proxy.
-readfixfork-vs-release drift so v13 Fetch (de)serialization is guaranteed compatible. Cheapest, highest-leverage, likely the actual trigger.NOT_LEADER_OR_FOLLOWERto connection-level failures) instead of silently returningREQUEST_TIMED_OUT.Questions for discussion
-readfixbroker tag an outdated fork, and is the matched-pair drift the accepted root cause — or is there a known v13 encode/decode incompatibility in the current release pair regardless of build?Happy to provide the full read-path report and reproduction artifacts.