Fetch returns empty/EOF (v13 decode) after a proxy restart — consume silently stalls until a coordinated restart

## Summary

After a routine **proxy restart** (host reboot, rollout), record **Fetch** through the kafSCALE proxy returns **0 records** to every consumer, while metadata / `ListOffsets` keep working correctly. No client error is raised — the consumer just idle-times-out. The data is durable (segments in S3, high-watermark correct); only the record-Fetch read path is broken, and it stays broken until a manual **coordinated broker + proxy restart**.

This is a release-grade reliability concern: a routine ops action silently stalls the core read path for all consumers.

Observed on `er1-bdr` (KIND on master2): broker `v1.6.0-readfix` + proxy `v1.6.0`. Reproduced from the host, a compose container, and a remote client.

Per ADR-0002 the Kafka entrypoint is always the **proxy**; clients never talk to the broker directly. We do **not** want a Kafka wire-protocol change — the fix should land in the proxy (and/or as a version/build alignment), not in a broker wire widening.

## Symptom

`ListOffsets` / `kafka-get-offsets`: correct HWM (e.g. `er1a` `495/521/6` = 1022). OK.
Consume (franz-go group-less, cp-kafka, remote): **0 records**, no client error.

Proxy log on each fetch:
```
WARN fetch forward failed target=scalytics-primary-broker...:9092 error="read frame size: EOF"
WARN fetch forward failed target=scalytics-primary-broker...:9092 error="decode fetch response v13: response did not contain enough data to be valid"
```
The proxy forwards to the correct internal broker DNS (not a self-loop). The broker returns a partial/EOF **v13** response.

## Root cause analysis (from source)

Three factors compound:

**1. The proxy is a full Fetch codec, not a byte forwarder.**
`forwardFetch` (`cmd/proxy/main.go`) **decodes** the broker's `FetchResponse`, fans sub-requests out per broker, merges them, and **re-encodes** the response (`protocol.EncodeResponse`). So any mismatch between *broker-side encode* and *proxy-side decode* of a Fetch response breaks consume — the proxy cannot parse the broker's reply.

**2. Fetch v13 + a build/version drift between broker and proxy.**
The broker advertises **Fetch min 11 / max 13** (`cmd/broker/main.go`: `{key: protocol.APIKeyFetch, minVersion: 11, maxVersion: 13}`), so clients negotiate v13. In the failing setup the **broker ran a fork tag (`v1.6.0-readfix`) while the proxy was plain `v1.6.0`** — exactly the v13 (de)serialization is not guaranteed to match between those builds:
- `read frame size: EOF` (`pkg/protocol/frame.go` `ReadFrame`) = the broker returned nothing / closed the connection.
- `decode fetch response v13: not enough data` = the proxy read a frame, but it was too short to be a valid v13 `FetchResponse`.

A matched official `v1.6.0` pair was previously **byte-clean** in a group-less franz-go round-trip (our read-path report, 2026-06-04), which points to the drift as the trigger.

**3. The proxy does not recover — it retries only the wrong error class.**
`forwardFetch`'s retry loop retries **only** `NOT_LEADER_OR_FOLLOWER` partitions. A connection / decode failure is mapped to `REQUEST_TIMED_OUT` and is **not** retried on a fresh connection. The bad connection is closed (`fanOutFetch`), but the fetch is reported as failed, so the client sees an empty read. The coordinated restart only "heals" it by forcing broker and proxy back into a freshly-matched state.

**Conclusion:** not data loss, not a broken storage path. It is a **Fetch-v13 compatibility / connection-state gap between a fork-broker and a release-proxy that the proxy does not absorb.**

## Workaround (verified)

Coordinated broker + proxy restart re-negotiates a matched pair and restores consume:
```bash
kubectl -n kafscale delete pod scalytics-primary-broker-0
kubectl -n kafscale wait --for=condition=ready pod/scalytics-primary-broker-0 --timeout=120s
kubectl -n kafscale rollout restart deploy/kafscale-proxy
# -> consume returns records again (verified: 1022 records, 0 failed)
```

## What we want to achieve / proposed direction (discussion)

No wire-protocol change; fix in proxy / build / chart; always via the proxy.

1. **Matched build pair.** Proxy and broker built from the same source/tag; eliminate the `-readfix` fork-vs-release drift so v13 Fetch (de)serialization is guaranteed compatible. Cheapest, highest-leverage, likely the actual trigger.
2. **Proxy resilience to a stale/half-open backend connection.** On a frame / EOF / decode error, discard the connection **and retry on a fresh one** (extend the retry beyond `NOT_LEADER_OR_FOLLOWER` to connection-level failures) instead of silently returning `REQUEST_TIMED_OUT`.
3. **(Optional) Down-negotiate the forwarded Fetch version** to the broker's reliably-served version rather than blindly forwarding v13.
4. **Regression smoke:** restart the proxy, then assert a consume round-trip still returns records.

## Questions for discussion

- Is the `-readfix` broker tag an outdated fork, and is the matched-pair drift the accepted root cause — or is there a known v13 encode/decode incompatibility in the current release pair regardless of build?
- Would maintainers prefer the proxy to (2) auto-recover connection-level Fetch failures, and/or (3) pin/down-negotiate the forwarded Fetch version? Either keeps the wire protocol unchanged.
- Is there appetite for the proxy-restart consume-regression smoke as a CI gate?

Happy to provide the full read-path report and reproduction artifacts.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch returns empty/EOF (v13 decode) after a proxy restart — consume silently stalls until a coordinated restart #155

Summary

Symptom

Root cause analysis (from source)

Workaround (verified)

What we want to achieve / proposed direction (discussion)

Questions for discussion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Fetch returns empty/EOF (v13 decode) after a proxy restart — consume silently stalls until a coordinated restart #155

Description

Summary

Symptom

Root cause analysis (from source)

Workaround (verified)

What we want to achieve / proposed direction (discussion)

Questions for discussion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions