Skip to content

Dead-letter Kafka records individually within a batch#33

Open
mcruzdev wants to merge 4 commits into
kubesmarts:mainfrom
mcruzdev:kafka-batch-per-record-dlq
Open

Dead-letter Kafka records individually within a batch#33
mcruzdev wants to merge 4 commits into
kubesmarts:mainfrom
mcruzdev:kafka-batch-per-record-dlq

Conversation

@mcruzdev

@mcruzdev mcruzdev commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Changes

The consumer now reads records in batches (Message<ConsumerRecords>). Batch ack/nack apply to the whole batch, so a single bad record cannot be nacked on its own without affecting the rest.

Instead of nacking, each failed record is produced to the data-index-events-dlq topic via a dedicated emitter, and the batch is acked exactly once after all dead-letter writes complete.

The batch offset is committed only after every dead-letter write has finished, so a failed record is never dropped: if a dead-letter write itself fails the batch is not committed (failure-strategy=fail) and the records are reprocessed.

Successfully processed records remain committed via the same ack.

Replaces the incoming dead-letter-queue failure strategy with an outgoing data-index-events-dlq channel.

Closes #32

  The consumer now reads records in batches (Message<ConsumerRecords>).
  Batch ack/nack apply to the whole batch, so a single bad record cannot
  be nacked on its own without affecting the rest. Instead of nacking,
  each failed record is produced to the data-index-events-dlq topic via a
  dedicated emitter, and the batch is acked exactly once after all
  dead-letter writes complete.

  The batch offset is committed only after every dead-letter write has
  finished, so a failed record is never dropped: if a dead-letter write
  itself fails the batch is not committed (failure-strategy=fail) and the
  records are reprocessed. Successfully processed records remain committed
  via the same ack.

  Replaces the incoming dead-letter-queue failure strategy with an
  outgoing data-index-events-dlq channel.

Signed-off-by: Matheus Cruz <matheuscruz.dev@gmail.com>
@mcruzdev mcruzdev force-pushed the kafka-batch-per-record-dlq branch from f069f64 to e108a5e Compare June 17, 2026 17:55
@mcruzdev mcruzdev marked this pull request as ready for review June 18, 2026 01:04
@gmunozfe

gmunozfe commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@mcruzdev I have add this PR mcruzdev#2 that batches both workflow and task persistence.

In my benchmark, after adding this fix, we got a great improvement for "fork10" scenario with 40 requests/second:

Before persistence batch:

  • Kafka consumer ~700 events/sec
  • DB materialization ~290 rows/sec
  • lag at finish ~331k

After persistence batch:

  • Kafka consumer ~3.6k events/sec (5x)
  • DB materialization ~1.31k rows/sec (4.5x)
  • lag at finish ~9k and lag=0 after ~11s (almost real time, polling window is 10s)

gmunozfe and others added 2 commits June 23, 2026 09:33
The batched task upsert (persistBatch) cannot create the placeholder
workflow needed when a task event arrives before its workflow, so the
whole batch failed with a foreign key violation and was retried forever.
On a FK violation, roll back and fall back to per-record persistence,
which already creates placeholders idempotently.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Matheus Cruz <matheuscruz.dev@gmail.com>
@fjtirado fjtirado self-requested a review June 25, 2026 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data Index ingestion bottleneck in Mode 3 with Kafka and postgresql: consider grouping events

3 participants