Skip to content

[schema] Bump AdminOperation v98 for prc record-count verification (stacked on #2776)#2777

Merged
ymuppala merged 3 commits intolinkedin:mainfrom
ymuppala:prc-schema-admin-op-v98
May 7, 2026
Merged

[schema] Bump AdminOperation v98 for prc record-count verification (stacked on #2776)#2777
ymuppala merged 3 commits intolinkedin:mainfrom
ymuppala:prc-schema-admin-op-v98

Conversation

@ymuppala
Copy link
Copy Markdown
Collaborator

@ymuppala ymuppala commented May 6, 2026

Problem

In March 2026, a Spark-based VPJ KIF repush silently wrote batch data to the wrong datacenter (incident-10468). VPJ reported success while two of three DCs lost batch data. The root cause was an unguarded trust boundary: Venice servers had no way to verify that they actually received the records VPJ claimed it sent.

The full fix has three parts: (1) a producer-side prc PubSub header on EOP (already shipped in #2758); (2) consumer-side counting + comparison + per-store opt-in throw path; (3) failure plumbing back to the push job. Parts (2) and (3) require schema bumps, which land schema-only first per repo policy, then in the consumer-code PR.

Solution

This PR adds AdminOperation v98. Companion to #2776 (which adds PartitionState v23 + StoreMetaValue v43); split out separately because AdminOperation.avsc alone is 1309 lines and combining all three would push #2776 over the enforce-lines-added 2000-line cap.

Stacking on #2776

Both #2776 and this PR modify the same versionOverrides block in build.gradle. To avoid a merge conflict when both PRs land, this PR is stacked on top of #2776:

Why AdminOperation needs a new version

Aspect Value
Layer Parent → child controller admin topic
Field added on SetStore batchPushRecordCountVerificationEnabled (boolean, default false)
Why a new version is required Store config is authored on the parent controller (where UpdateStore lands) and replicated to every child controller via this admin topic. Without a bump on the SetStore message, child controllers can't receive the new flag — opting a store in on the parent would be a no-op everywhere else.

The change is strictly additive with safe default false. Forward-compatibility: a v98-aware reader can still consume v97 records (missing field reads as false). Backward-compatibility: a v97-aware reader consumes v98 records by ignoring the extra field. No coordinated cutover required.

Build pinning during the schema-only window

build.gradle adds an entry to compileAvro.versionOverrides pinning the active AdminOperation schema to v97 even while v98 sits in the source tree. Pattern documented at build.gradle:328-330. This is critical because:

  • The new .avsc file needs to be physically merged so it's in main before any consumer code references it.
  • If we let the build pick up v98 immediately, it would become the "active" schema — but no controller code reads/writes the new field yet. Bumping AvroProtocolDefinition.ADMIN_OPERATION to 98 here without consumer code would lock us into a half-built state where serialized admin ops record protocol version 98 but carry no new payload.

Keeping AvroProtocolDefinition.java at the old constant and pinning the gradle build to v97 ensures zero runtime behavior change from this PR — v98 is dormant in the source tree, ready for the follow-up consumer-code PR to activate it in lockstep with VeniceParentHelixAdmin and AdminExecutionTask wiring.

Companion PRs

  • #2776 — PartitionState v23 + StoreMetaValue v43 (schema-only, also pinned). This PR depends on [schema] Bump PartitionState v23 + StoreMetaValue v43 for prc record-count verification #2776 — review/merge in order.
  • Consumer-code PR (rebased from #2774, lands after both schema PRs merge) — removes all versionOverrides pins, bumps AvroProtocolDefinition.PARTITION_STATE (22→23), METADATA_SYSTEM_SCHEMA_STORE (42→43), ADMIN_OPERATION (97→98) constants, wires the actual consumer code: server-side counting, EOP comparison, per-store throw path, controller statusDetails propagation, VPJ checkpoint detection.

Test plan

  • ./gradlew :services:venice-controller:compileAvro — log confirms (OVERRIDE) annotation on AdminOperation/v97 (the pinned version actually picked).
  • ./gradlew :internal:venice-common:compileAvro — same for PartitionState/v22 and StoreMetaValue/v42 (inherited from [schema] Bump PartitionState v23 + StoreMetaValue v43 for prc record-count verification #2776's commit).
  • CI: AvroCompatibilityTests verifies the new schemas are forward/backward compatible with all prior versions.

Backwards compatibility

  • No. The new field is additive with safe default; older readers treat it as default false.
  • This PR alone introduces zero runtime behavior change. Behavior changes arrive in the follow-up consumer-code PR.

🤖 Generated with Claude Code

…count

verification

Schema-only PR. Adds two of the three schema versions needed for upcoming
server-side batch-push record-count verification (prc) work. AdminOperation
v98 lands in a separate follow-up PR (split to stay under the
enforce-lines-added 2000-line cap).

Schemas added:
- PartitionState v22 -> v23: adds `batchPushRecordCount` (long, default 0).
  Persists the server-side count of batch-push data records, used at EOP
  to verify against the producer-side `prc` PubSub header (PR linkedin#2758).
- StoreMetaValue v42 -> v43: adds `batchPushRecordCountVerificationEnabled`
  (boolean, default false). Per-store opt-in for the throw path on
  record-count deficits.

Both additive with safe defaults — backward-compatible.

`build.gradle` pins compileAvro to v22/v42 via `versionOverrides`. The
follow-up consumer-code PR will remove the pins and bump
`AvroProtocolDefinition.PARTITION_STATE` (22->23) and
`METADATA_SYSTEM_SCHEMA_STORE` (42->43) constants in lockstep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 21:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new AdminOperation v98 Avro schema to carry a new store-level flag used by upcoming server-side batch-push record-count verification, while keeping current runtime behavior unchanged by pinning Avro generation to v97 during the schema-only window.

Changes:

  • Added services/venice-controller/.../AdminOperation/v98/AdminOperation.avsc with the new boolean field batchPushRecordCountVerificationEnabled (default false) on the store-config admin message payload.
  • Updated root build.gradle compileAvro.versionOverrides to keep the controller’s active AdminOperation schema pinned to v97 until the follow-up wiring PR lands.

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated no comments.

File Description
services/venice-controller/src/main/resources/avro/AdminOperation/v98/AdminOperation.avsc Introduces AdminOperation v98 with the new store config flag (additive, default false).
build.gradle Pins controller Avro generation to AdminOperation v97 so v98 can land inertly until code is ready.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Schema-only PR. Adds AdminOperation v98 to carry the new per-store
`batchPushRecordCountVerificationEnabled` flag on the SetStore admin op,
so parent controllers can propagate it to child controllers via the
admin topic.

Stacked on top of PR linkedin#2776 (PartitionState v23 + StoreMetaValue v43)
because both PRs touch the same `versionOverrides` block in build.gradle.
Stacking ensures the override list grows monotonically with no merge
conflict when both PRs land. Once linkedin#2776 merges, this PR's branch will be
rebased on new main and the diff will collapse to just the AdminOperation
additions (the v22 / v42 pins from linkedin#2776 will already be in main).

Field added (additive, default false):
  batchPushRecordCountVerificationEnabled (boolean) on SetStore.

`build.gradle` extends the existing `versionOverrides` list
(PartitionState/v22
and StoreMetaValue/v42 from linkedin#2776) with AdminOperation/v97. The follow-up
consumer-code PR removes all three pins together when bumping
`AvroProtocolDefinition.PARTITION_STATE` / METADATA_SYSTEM_SCHEMA_STORE /
ADMIN_OPERATION constants in lockstep with the controller wiring.

Backward-compatible: v97-aware deserializers reading v98 records treat
the unknown field as default `false`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ymuppala ymuppala force-pushed the prc-schema-admin-op-v98 branch from 5b42eb0 to d332d70 Compare May 6, 2026 22:13
@ymuppala ymuppala changed the title [schema] Bump AdminOperation v98 for prc record-count verification [schema] [VALIDATION_OVERRIDE] Bump AdminOperation v98 for prc record-count verification (stacked on #2776) May 6, 2026
@ymuppala ymuppala changed the title [schema] [VALIDATION_OVERRIDE] Bump AdminOperation v98 for prc record-count verification (stacked on #2776) [schema] Bump AdminOperation v98 for prc record-count verification (stacked on #2776) May 7, 2026
@ymuppala
Copy link
Copy Markdown
Collaborator Author

ymuppala commented May 7, 2026

@copilot resolve the merge conflicts in this pull request

Copilot AI review requested due to automatic review settings May 7, 2026 19:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 2 changed files in this pull request and generated no new comments.

@ymuppala ymuppala merged commit a780234 into linkedin:main May 7, 2026
110 checks passed
ymuppala added a commit to ymuppala/venice that referenced this pull request May 7, 2026
linkedin#2776 (PartitionState v23 + StoreMetaValue v43) and linkedin#2777 (AdminOperation
v98) merged with versionOverrides pinning compileAvro to the prior
versions (v22, v42, v97). This PR consumes those new schemas via the
AvroProtocolDefinition bumps (already in earlier commits on this branch),
so the pins must come down in lockstep.

After this change, compileAvro picks v23/v43/v98 (no OVERRIDE) and the
generated SpecificRecord classes contain the new fields. The unrelated
KafkaMessageEnvelope/v13 pin (from linkedin#2778) is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants