Backups: Add support for cluster and keyspace level backup schedules#758
Backups: Add support for cluster and keyspace level backup schedules#758
Conversation
And stagger them across shards to avoid resource usage issues. Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR enhances the VitessBackupSchedule feature to reduce operational overhead and bandwidth spikes in large Vitess deployments by introducing scope-based strategy expansion (Shard/Keyspace/Cluster) and frequency-based scheduling that deterministically staggers per-shard cron schedules.
Changes:
- Add
scopesupport to expand strategies at reconcile-time for Keyspace- and Cluster-wide backups, including auto-exclusion of keyspaces overridden by Keyspace-scope schedules. - Add
frequency(Go duration) as an alternative to cronschedule, generating deterministic staggered per-shard cron expressions and surfacing them via status. - Update CRDs/docs and add unit + e2e coverage for the new scheduling/expansion behavior.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go |
Implements strategy expansion, frequency-derived schedules, new status fields, and shard bootstrap gating logic. |
pkg/controller/vitessbackupschedule/schedule_generator.go |
Adds deterministic cron generation from a validated frequency. |
pkg/apis/planetscale/v2/vitessbackupschedule_types.go |
Extends the API with BackupScope, frequency, and new status maps. |
pkg/apis/planetscale/v2/vitessbackupschedule_methods.go |
Adds validation for schedule/frequency mutual exclusivity, supported frequencies, and scope/keyspace/shard consistency. |
deploy/crds/planetscale.com_vitessbackupschedules.yaml |
Updates CRD schema for new fields and relaxed requirements. |
deploy/crds/planetscale.com_vitessclusters.yaml |
Updates embedded schedule schema for new fields and relaxed requirements. |
pkg/operator/vitessbackup/schedule.go |
Allows schedule creation when either schedule or frequency is set. |
pkg/apis/planetscale/v2/zz_generated.deepcopy.go |
Deepcopy support for new status maps. |
pkg/controller/vitessbackupschedule/schedule_generator_test.go |
Unit tests for frequency→cron generation. |
pkg/controller/vitessbackupschedule/expand_strategy_test.go |
Unit tests for strategy expansion and exclusion semantics. |
pkg/apis/planetscale/v2/vitessbackupschedule_methods_test.go |
Unit tests for new validation and status initialization. |
test/endtoend/backup_schedule_keyspace_test.sh |
New e2e test for keyspace/cluster scope schedules and status observability. |
test/endtoend/operator/102_initial_cluster_keyspace_backup_schedule.yaml |
New e2e fixture cluster with cluster+keyspace schedules using frequency. |
test/endtoend/operator/401_scheduled_backups.yaml |
Extends existing e2e fixture with frequency + scope examples. |
test/endtoend/operator/operator-latest.yaml |
Updates embedded CRDs used by e2e tests. |
docs/api.md / docs/api/index.html |
Regenerated API docs for new fields/types. |
Makefile |
Adds backup-schedule-keyspace-test target. |
.buildkite/pipeline.yml |
Adds CI step to run the new e2e test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
6ed1a3d to
4fbc25a
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
|
/cc @bluecrabs007 please let me know if you have any feedback. Thanks! ❤️ |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Matt Lord <mattalord@gmail.com>
This looks pretty good, thank you! |
nickvanw
left a comment
There was a problem hiding this comment.
Left three inline comments for the correctness issues I found in the new exclusion/status paths.
pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Matt Lord <mattalord@gmail.com>
nickvanw
left a comment
There was a problem hiding this comment.
Left two follow-up comments for the remaining correctness issues.
Signed-off-by: Matt Lord <mattalord@gmail.com>
nickvanw
left a comment
There was a problem hiding this comment.
Approving to unblock.
Practical take: the happy path here looks good, and the focused tests plus package-level Go tests passed on my side. The remaining comments are edge-case semantics rather than happy-path regressions. If those edge cases matter for the intended support model, they’re worth a follow-up; otherwise I don’t think they should block merge.
| if sched.Spec.Cluster != vbsc.Spec.Cluster || scheduleSuspended(sched) { | ||
| continue | ||
| } | ||
| if err := sched.Spec.VitessBackupScheduleTemplate.ValidateScheduleConfig(); err != nil { |
There was a problem hiding this comment.
This now skips peers that fail ValidateScheduleConfig() / ValidateStrategies(), which fixes the obvious invalid-override case. One edge case still remains: a peer schedule that passes those checks but would fail this controller's duplicate-effective-target validation can still contribute exclusions here. In that case the peer can never successfully reconcile, but it can still suppress cluster-scope coverage for the keyspace. If exclusion is meant to mean "this peer can actually take over coverage", it probably needs the same effective-target validity gate as the main reconcile path.
There was a problem hiding this comment.
Fixed. Peer schedules now have to pass the same duplicate-effective-target preflight before they can contribute exclusions, so a peer that can never reconcile no longer suppresses cluster-scope coverage. Covered by TestExpandStrategy_ClusterScopeDuplicateInvalidPeerDoesNotExclude.
| } | ||
|
|
||
| func (r *ReconcileVitessBackupsSchedule) validateNoDuplicateEffectiveShardTargets(ctx context.Context, vbsc planetscalev2.VitessBackupSchedule) error { | ||
| expansionCtx, err := r.buildLocalStrategyExpansionContext(ctx, vbsc) |
There was a problem hiding this comment.
This duplicate-target preflight only expands against local exclusions. That means a config like:
- this schedule:
Cluster+ explicitShard(customer/-80) - peer schedule:
Keyspace(customer)
gets rejected here even though the real runtime expansion for this schedule would exclude customer because of the peer override. So this blocks a valid "cluster default + peer keyspace override + extra hot-shard backup" shape. If that additive model is intended, this validator probably needs to reason over the full exclusion context rather than only local exclusions.
There was a problem hiding this comment.
Fixed. Duplicate-target validation now uses the full exclusion context instead of only local exclusions, so peer keyspace overrides are taken into account before we reject additive shapes. Covered by TestReconcileStrategies_AllowsExplicitShardWhenPeerKeyspaceOverrideExcludesClusterExpansion.
|
Link a feature request #767 |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
frouioui
left a comment
There was a problem hiding this comment.
let's port the changes (to the api, and to the tests) to the vitess repository, otherwise lgtm! good work
Fixes: #752
Problem
The VitessBackupSchedule feature requires users to specify each shard explicitly in
strategies[]. For large deployments with many shards across multiple keyspaces, this creates two problems:Solution
This PR adds three capabilities to VitessBackupSchedule:
1. Scope-based strategies
A new
scopefield onVitessBackupScheduleStrategywith three values:Shard(default) -- existing behavior, targets a single explicit shardKeyspace-- dynamically discovers all shards in the specified keyspaceCluster-- dynamically discovers all shards across all keyspaces in the clusterScope expansion happens in the controller at reconcile time, so it automatically picks up new shards added during resharding without any config changes.
2. Frequency-based scheduling
A new
frequencyfield (Go duration string like"24h","6h","30m") as an alternative to the existingschedule(cron) field. Whenfrequencyis set, the controller generates deterministic per-shard cron schedules that are staggered across the interval using a SHA-256 hash of the shard identity. This means:.status.generatedSchedulesfor observability3. Auto-exclusion
When a Keyspace-scope strategy exists for a keyspace (in any VitessBackupSchedule for the same cluster), that keyspace is automatically excluded from Cluster-scope expansion. This enables clean override semantics:
Shard-scope strategies do NOT trigger exclusion -- they are additive (e.g., an extra backup for a hot shard).
Implementation Details
API Changes (
pkg/apis/planetscale/v2/)vitessbackupschedule_types.go:BackupScopeenum type (Shard,Keyspace,Cluster)Scopefield toVitessBackupScheduleStrategy(optional, defaults toShard)Frequencyfield toVitessBackupScheduleTemplate(optional, mutually exclusive withSchedule)GeneratedSchedules map[string]stringtoVitessBackupScheduleStatusNextScheduledTimes map[string]*metav1.TimetoVitessBackupScheduleStatusNewVitessBackupScheduleStatus()to initialize the new mapsvitessbackupschedule_methods.go:ValidateScheduleConfig()-- ensures exactly one of Schedule/Frequency is set; validates frequency is cron-representableValidateBackupFrequency()-- rejects sub-minute, non-whole-minute, non-cron-divisible frequencies with actionable error messagesValidateStrategies()-- validates Scope/Keyspace/Shard field consistencySchedule Generator (
pkg/controller/vitessbackupschedule/schedule_generator.go) -- NEWgenerateCronFromFrequency(frequency, cluster, keyspace, shard, scheduleName):sha256(cluster|keyspace|shard|scheduleName)to produce a deterministic uint64 offsetoffset = hash % totalMinutescronFromInterval():"MM HH * * *""MM HH/step * * *""MM * * * *""MM/step * * * *"cron.ParseStandard()Controller Changes (
pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go)reconcileStrategies()-- modified to:strategyExpansionContext(caches all keyspace/shard data and excluded keyspaces once per reconcile)expandStrategy()for each strategy to get per-shard strategiesLastScheduledTimes,GeneratedSchedules, andNextScheduledTimesentries for strategies that no longer existbuildStrategyExpansionContext()-- new method:expandStrategy()-- new method:Shardscope: returns strategy as-isKeyspacescope: uses cached shards from expansion context, returns one strategy per shard with unique names ("{base}-{keyspace}-{shardSafe}")Clusterscope: iterates cached keyspaces, skips excluded ones, expands remaining keyspacesgetExcludedKeyspaces()-- new method:map[string]boolsetreconcileStrategy()-- modified to:FrequencyviagenerateCronFromFrequency()when setstatus.generatedSchedulesstatus.nextScheduledTimesfor all code pathsshardReadyForScheduledBackup()-- new function:getNextSchedule()-- refactored to accept cron schedule string as parameter instead of reading fromvbsc.Spec.ScheduleOther Changes
pkg/operator/vitessbackup/schedule.go:NewVitessBackupSchedule()nil check to acceptFrequencyas alternative toSchedule.buildkite/pipeline.yml:backup-schedule-keyspace-testCI step to run the new e2e test in Buildkitetest/endtoend/operator/operator-latest.yaml:frequency,scope,generatedSchedules, andnextScheduledTimesfieldstest/endtoend/operator/102_initial_cluster_keyspace_backup_schedule.yamlandtest/endtoend/operator/401_scheduled_backups. +yaml:mysql80tomysql84Generated files (via
make generate):deploy/crds/planetscale.com_vitessbackupschedules.yaml-- CRD updateddeploy/crds/planetscale.com_vitessclusters.yaml-- CRD updated (schedules are embedded in VitessCluster)pkg/apis/planetscale/v2/zz_generated.deepcopy.go-- deepcopy for new status mapsdocs/api.md,docs/api/index.html-- API docs regeneratedBackward Compatibility
All changes are backward compatible:
Scopedefaults toShardwhen omitted, preserving existing per-shard behaviorSchedulefield remains supported;Frequencyis an alternativekeyspace/shardstrategies work unchangedbackup-schedule-teste2e test passes without modificationTesting
Unit Tests (26 new)
schedule_generator_test.go(11 tests):expand_strategy_test.go(12 tests):vitessbackupschedule_methods_test.go(3 tests):E2E Tests
backup_schedule_keyspace_test.sh(new):frequency: "1m")frequency: "1m")generatedSchedulesappear in status (frequency-based scheduling works)