[schema] Bump PartitionState v23 + StoreMetaValue v43 for prc record-count verification#2776
Merged
ymuppala merged 2 commits intolinkedin:mainfrom May 7, 2026
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces new Avro protocol versions (schema-only, additive) needed for upcoming server-side batch-push record-count verification, while keeping the build pinned to the prior schema versions so existing generated classes/protocol constants remain unchanged until the follow-up wiring lands.
Changes:
- Add
PartitionStatev23 withbatchPushRecordCount(long, default 0). - Add
StoreMetaValuev43 andAdminOperationv98 withbatchPushRecordCountVerificationEnabled(boolean, default false). - Pin
compileAvro.versionOverridesinbuild.gradleto continue generating code from v22/v42/v97 until the follow-up PR removes the pins and updates protocol constants.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| services/venice-controller/src/main/resources/avro/AdminOperation/v98/AdminOperation.avsc | Adds the per-store verification enablement flag to the admin topic protocol (UpdateStore payload) in a new schema version. |
| internal/venice-common/src/main/resources/avro/StoreMetaValue/v43/StoreMetaValue.avsc | Adds the per-store verification enablement flag to the store metadata system store schema in a new version. |
| internal/venice-common/src/main/resources/avro/PartitionState/v23/PartitionState.avsc | Adds persisted batchPushRecordCount to support consumer-side record count tracking in a new version. |
| build.gradle | Pins Avro codegen to prior schema versions via versionOverrides to avoid runtime/protocol mismatches until follow-up code lands. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…count verification Schema-only PR. Adds two of the three schema versions needed for upcoming server-side batch-push record-count verification (prc) work. AdminOperation v98 lands in a separate follow-up PR (split to stay under the enforce-lines-added 2000-line cap). Schemas added: - PartitionState v22 -> v23: adds `batchPushRecordCount` (long, default 0). Persists the server-side count of batch-push data records, used at EOP to verify against the producer-side `prc` PubSub header (PR linkedin#2758). - StoreMetaValue v42 -> v43: adds `batchPushRecordCountVerificationEnabled` (boolean, default false). Per-store opt-in for the throw path on record-count deficits. Both additive with safe defaults — backward-compatible. `build.gradle` pins compileAvro to v22/v42 via `versionOverrides`. The follow-up consumer-code PR will remove the pins and bump `AvroProtocolDefinition.PARTITION_STATE` (22->23) and `METADATA_SYSTEM_SCHEMA_STORE` (42->43) constants in lockstep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
ymuppala
added a commit
to ymuppala/venice
that referenced
this pull request
May 6, 2026
Schema-only PR. Adds AdminOperation v98 to carry the new per-store `batchPushRecordCountVerificationEnabled` flag on the SetStore admin op, so parent controllers can propagate it to child controllers via the admin topic. Stacked on top of PR linkedin#2776 (PartitionState v23 + StoreMetaValue v43) because both PRs touch the same `versionOverrides` block in build.gradle. Stacking ensures the override list grows monotonically with no merge conflict when both PRs land. Once linkedin#2776 merges, this PR's branch will be rebased on new main and the diff will collapse to just the AdminOperation additions (the v22 / v42 pins from linkedin#2776 will already be in main). Field added (additive, default false): batchPushRecordCountVerificationEnabled (boolean) on SetStore. `build.gradle` extends the existing `versionOverrides` list (PartitionState/v22 and StoreMetaValue/v42 from linkedin#2776) with AdminOperation/v97. The follow-up consumer-code PR removes all three pins together when bumping `AvroProtocolDefinition.PARTITION_STATE` / METADATA_SYSTEM_SCHEMA_STORE / ADMIN_OPERATION constants in lockstep with the controller wiring. Backward-compatible: v97-aware deserializers reading v98 records treat the unknown field as default `false`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xunyin8
approved these changes
May 7, 2026
ymuppala
added a commit
that referenced
this pull request
May 7, 2026
ymuppala
added a commit
to ymuppala/venice
that referenced
this pull request
May 7, 2026
linkedin#2776 (PartitionState v23 + StoreMetaValue v43) and linkedin#2777 (AdminOperation v98) merged with versionOverrides pinning compileAvro to the prior versions (v22, v42, v97). This PR consumes those new schemas via the AvroProtocolDefinition bumps (already in earlier commits on this branch), so the pins must come down in lockstep. After this change, compileAvro picks v23/v43/v98 (no OVERRIDE) and the generated SpecificRecord classes contain the new fields. The unrelated KafkaMessageEnvelope/v13 pin (from linkedin#2778) is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sushantmane
added a commit
to sushantmane/venice
that referenced
this pull request
May 8, 2026
Main landed PR linkedin#2776 (PartitionState v23 with batchPushRecordCount, pinned to v22 until prc record-count consumer code lands), so my donor-clock freshness field can no longer claim v23. Move it to a new v24 that carries v23 + donorHeartbeatTimestampMs together. Bumps the PartitionState avro override from v22 → v24 so consumer code in this PR can read/write the new field. v23's batchPushRecordCount is exposed as a side effect, but no consumer reads/writes it yet — existing data sees default 0 and behaves unchanged. The prc record-count team's follow-up PR will replace this override (or remove it once v24 is the real latest) when their consumer code lands.
sushantmane
added a commit
to sushantmane/venice
that referenced
this pull request
May 8, 2026
…in-place Earlier I bumped to v24 to avoid colliding with PR linkedin#2776's v23 (batchPushRecordCount), but v23 has no production data — PR linkedin#2776 pinned the avro override at v22 because its consumer code wasn't ready, so v23 has only ever existed as a schema file. That makes it safe to extend v23 in-place with one more additive field instead of carrying an unused intermediate version. Result: v22 (active wire format today) └─ v23 = batchPushRecordCount (PR linkedin#2776) + donorHeartbeatTimestampMs (thi s PR) Avro override bumps from v22 → v23. AvroProtocolDefinition.PARTITION_STATE goes from (24, 24) to (24, 23). v24 directory deleted. The prc record-count team's follow-up PR no longer needs to coordinate a v24/v23 swap — they just remove the override (or land their consumer code on v23 alongside this PR's). batchPushRecordCount is now part of the active schema; default 0 means replicas behave unchanged until the prc consumer code reads/writes it. Verified: TestOffsetRecord (all pass), LeaderFollowerStoreIngestionTaskTest blob-transfer suite (all 7 pass), reportIfCatchUpVersionTopicOffset and testReadyToServePartitionValidateIngestionSuccessWithPriorState (pass for AA_ON / AA_OFF). Generated PartitionState class carries both fields.
sushantmane
added a commit
to sushantmane/venice
that referenced
this pull request
May 9, 2026
Main landed PR linkedin#2776 (PartitionState v23 with batchPushRecordCount, pinned to v22 until prc record-count consumer code lands), so my donor-clock freshness field can no longer claim v23. Move it to a new v24 that carries v23 + donorHeartbeatTimestampMs together. Bumps the PartitionState avro override from v22 → v24 so consumer code in this PR can read/write the new field. v23's batchPushRecordCount is exposed as a side effect, but no consumer reads/writes it yet — existing data sees default 0 and behaves unchanged. The prc record-count team's follow-up PR will replace this override (or remove it once v24 is the real latest) when their consumer code lands.
sushantmane
added a commit
to sushantmane/venice
that referenced
this pull request
May 9, 2026
…in-place Earlier I bumped to v24 to avoid colliding with PR linkedin#2776's v23 (batchPushRecordCount), but v23 has no production data — PR linkedin#2776 pinned the avro override at v22 because its consumer code wasn't ready, so v23 has only ever existed as a schema file. That makes it safe to extend v23 in-place with one more additive field instead of carrying an unused intermediate version. Result: v22 (active wire format today) └─ v23 = batchPushRecordCount (PR linkedin#2776) + donorHeartbeatTimestampMs (thi s PR) Avro override bumps from v22 → v23. AvroProtocolDefinition.PARTITION_STATE goes from (24, 24) to (24, 23). v24 directory deleted. The prc record-count team's follow-up PR no longer needs to coordinate a v24/v23 swap — they just remove the override (or land their consumer code on v23 alongside this PR's). batchPushRecordCount is now part of the active schema; default 0 means replicas behave unchanged until the prc consumer code reads/writes it. Verified: TestOffsetRecord (all pass), LeaderFollowerStoreIngestionTaskTest blob-transfer suite (all 7 pass), reportIfCatchUpVersionTopicOffset and testReadyToServePartitionValidateIngestionSuccessWithPriorState (pass for AA_ON / AA_OFF). Generated PartitionState class carries both fields.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Part 2
A Spark-based VPJ KIF repush silently wrote batch data to the wrong datacenter (incident-10468).
KAFKA_INPUT_BROKER_URLleaked into the writer config and overwroteKAFKA_BOOTSTRAP_SERVERS. Two of three DCs lost batch data while the push job reported success — the store served partial data for hours before anyone noticed.The root cause was an unguarded trust boundary: Venice servers count records as they ingest, but never compared their count against what the producer claimed it sent. Any failure mode that drops records between VPJ and the server — wrong-DC writes, partial broker connectivity loss, intermittent produce errors not surfaced to VPJ — would replay 10468.
Catching this end-to-end requires:
prc(per-partition record count) PubSub header to every EOP control message.ExecutionStatus.ERRORto the controller and as a tagged checkpoint to the VPJ runner, so the push job fails fast instead of completing silently with missing data.Solution
Schema-only PR — adds two of the three Avro schema versions needed for the upcoming consumer-side verification, without any Java consumer code yet. AdminOperation v98 is split into a separate follow-up PR to stay under the
enforce-lines-added2000-line cap (two avsc files at 328 + 455 = 783 lines fit comfortably; AdminOperation alone is 1309 lines).The two follow-up PRs:
versionOverridespins from this PR + the AdminOperation PR, bumpsAvroProtocolDefinitionconstants in lockstep, wires the actual consumer code (server-side counting, EOP comparison, per-store throw path, controllerstatusDetailspropagation, VPJ checkpoint detection).Why each schema gets a new version
The schemas live in different layers, and each has its own reason to bump:
PartitionStatebatchPushRecordCount(long, default 0)activeKeyCount, etc.) so it survives restarts and rides along with blob-transfer snapshots. Without a schema bump the count has nowhere to live persistently —OffsetRecord(the runtime wrapper) is just a façade aroundPartitionState.StoreMetaValuebatchPushRecordCountVerificationEnabled(boolean, default false)UpdateStoreQueryParamsand roll it back the same way. The flag has to be visible alongside other store-level configs (batchPushEnabled,chunkingEnabled, etc.), and clients reading store metadata for any reason (admin tools, dashboards, push-job validation) need to see it.Both changes here are strictly additive with safe defaults (
0/false). Forward-compatibility: a v23-aware reader can still consume v22 records (missing field reads as default). Backward-compatibility: a v22-aware reader consumes v23 records by ignoring the extra field. No coordinated cutover required.Test plan
./gradlew :internal:venice-common:compileAvro— log confirms(OVERRIDE)annotations on v22 / v42 (the pinned versions actually picked)../gradlew :internal:venice-common:compileJava— Java still compiles cleanly with the old generated schema classes (because protocol constants are unchanged).Backwards compatibility
0/false).🤖 Generated with Claude Code