Skip to content

Backups: Add support for cluster and keyspace level backup schedules#758

Merged
mattlord merged 16 commits intomainfrom
keyspace_level_backups
Mar 27, 2026
Merged

Backups: Add support for cluster and keyspace level backup schedules#758
mattlord merged 16 commits intomainfrom
keyspace_level_backups

Conversation

@mattlord
Copy link
Copy Markdown
Collaborator

@mattlord mattlord commented Mar 8, 2026

Fixes: #752

Problem

The VitessBackupSchedule feature requires users to specify each shard explicitly in strategies[]. For large deployments with many shards across multiple keyspaces, this creates two problems:

  1. Operational burden -- users must maintain per-shard strategy entries as shards change during resharding. Every reshard requires a manual config update.
  2. Bandwidth spikes -- all shards back up at the same cron time, creating resource contention on storage and network.

Solution

This PR adds three capabilities to VitessBackupSchedule:

1. Scope-based strategies

A new scope field on VitessBackupScheduleStrategy with three values:

  • Shard (default) -- existing behavior, targets a single explicit shard
  • Keyspace -- dynamically discovers all shards in the specified keyspace
  • Cluster -- dynamically discovers all shards across all keyspaces in the cluster

Scope expansion happens in the controller at reconcile time, so it automatically picks up new shards added during resharding without any config changes.

2. Frequency-based scheduling

A new frequency field (Go duration string like "24h", "6h", "30m") as an alternative to the existing schedule (cron) field. When frequency is set, the controller generates deterministic per-shard cron schedules that are staggered across the interval using a SHA-256 hash of the shard identity. This means:

  • Same inputs always produce the same cron schedule (deterministic, no persistence needed)
  • Different shards get different offsets within the interval (staggered, no bandwidth spikes)
  • Generated schedules are surfaced in .status.generatedSchedules for observability
  • Only cron-representable frequencies are accepted (validated up front with clear error messages listing supported examples)

3. Auto-exclusion

When a Keyspace-scope strategy exists for a keyspace (in any VitessBackupSchedule for the same cluster), that keyspace is automatically excluded from Cluster-scope expansion. This enables clean override semantics:

# Cluster-wide default: daily
- name: "cluster-daily"
  frequency: "24h"
  strategies:
    - name: all
      scope: Cluster

# Override for 'customer': every 6 hours
- name: "customer-frequent"
  frequency: "6h"
  strategies:
    - name: customer-all
      scope: Keyspace
      keyspace: "customer"

# Result: 'customer' backed up every 6h. All OTHER keyspaces every 24h. No overlap.

Shard-scope strategies do NOT trigger exclusion -- they are additive (e.g., an extra backup for a hot shard).

Implementation Details

API Changes (pkg/apis/planetscale/v2/)

vitessbackupschedule_types.go:

  • Added BackupScope enum type (Shard, Keyspace, Cluster)
  • Added Scope field to VitessBackupScheduleStrategy (optional, defaults to Shard)
  • Added Frequency field to VitessBackupScheduleTemplate (optional, mutually exclusive with Schedule)
  • Added GeneratedSchedules map[string]string to VitessBackupScheduleStatus
  • Added NextScheduledTimes map[string]*metav1.Time to VitessBackupScheduleStatus
  • Updated NewVitessBackupScheduleStatus() to initialize the new maps

vitessbackupschedule_methods.go:

  • Added ValidateScheduleConfig() -- ensures exactly one of Schedule/Frequency is set; validates frequency is cron-representable
  • Added ValidateBackupFrequency() -- rejects sub-minute, non-whole-minute, non-cron-divisible frequencies with actionable error messages
  • Added ValidateStrategies() -- validates Scope/Keyspace/Shard field consistency

Schedule Generator (pkg/controller/vitessbackupschedule/schedule_generator.go) -- NEW

generateCronFromFrequency(frequency, cluster, keyspace, shard, scheduleName):

  • Hashes sha256(cluster|keyspace|shard|scheduleName) to produce a deterministic uint64 offset
  • Converts frequency to total minutes, computes offset = hash % totalMinutes
  • Generates appropriate cron expression via cronFromInterval():
    • Daily (1440m): "MM HH * * *"
    • Multi-hour (e.g. 360m/6h): "MM HH/step * * *"
    • Hourly (60m): "MM * * * *"
    • Sub-hourly (e.g. 30m): "MM/step * * * *"
  • Validates all output is parseable by cron.ParseStandard()

Controller Changes (pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go)

reconcileStrategies() -- modified to:

  1. Validate schedule config and strategy consistency up front
  2. Build a strategyExpansionContext (caches all keyspace/shard data and excluded keyspaces once per reconcile)
  3. Call expandStrategy() for each strategy to get per-shard strategies
  4. Reconcile each expanded strategy
  5. Clean up stale LastScheduledTimes, GeneratedSchedules, and NextScheduledTimes entries for strategies that no longer exist

buildStrategyExpansionContext() -- new method:

  • Pre-fetches all keyspaces/shards and excluded keyspaces once, avoiding redundant API calls when multiple strategies need the same data

expandStrategy() -- new method:

  • Shard scope: returns strategy as-is
  • Keyspace scope: uses cached shards from expansion context, returns one strategy per shard with unique names ("{base}-{keyspace}-{shardSafe}")
  • Cluster scope: iterates cached keyspaces, skips excluded ones, expands remaining keyspaces

getExcludedKeyspaces() -- new method:

  • Scans all strategies in the current VitessBackupSchedule
  • Lists all other VitessBackupSchedule objects in the same namespace with the same cluster label
  • Collects keyspace names from any Keyspace-scope strategies
  • Returns as map[string]bool set

reconcileStrategy() -- modified to:

  • Compute effective cron schedule from Frequency via generateCronFromFrequency() when set
  • Store generated schedule in status.generatedSchedules
  • Populate status.nextScheduledTimes for all code paths

shardReadyForScheduledBackup() -- new function:

  • Checks that the shard has a primary, has an initial backup, has tablets, and all tablets are ready
  • Used to skip scheduled backups for shards still bootstrapping (returns a retryable error instead of a terminal one)

getNextSchedule() -- refactored to accept cron schedule string as parameter instead of reading from vbsc.Spec.Schedule

Other Changes

pkg/operator/vitessbackup/schedule.go:

  • Updated NewVitessBackupSchedule() nil check to accept Frequency as alternative to Schedule

.buildkite/pipeline.yml:

  • Added backup-schedule-keyspace-test CI step to run the new e2e test in Buildkite

test/endtoend/operator/operator-latest.yaml:

  • Updated embedded CRDs to include new frequency, scope, generatedSchedules, and nextScheduledTimes fields

test/endtoend/operator/102_initial_cluster_keyspace_backup_schedule.yaml and test/endtoend/operator/401_scheduled_backups. +yaml:

  • Updated Vitess images from mysql80 to mysql84

Generated files (via make generate):

  • deploy/crds/planetscale.com_vitessbackupschedules.yaml -- CRD updated
  • deploy/crds/planetscale.com_vitessclusters.yaml -- CRD updated (schedules are embedded in VitessCluster)
  • pkg/apis/planetscale/v2/zz_generated.deepcopy.go -- deepcopy for new status maps
  • docs/api.md, docs/api/index.html -- API docs regenerated

Backward Compatibility

All changes are backward compatible:

  • Scope defaults to Shard when omitted, preserving existing per-shard behavior
  • Schedule field remains supported; Frequency is an alternative
  • Existing VitessBackupSchedule objects with explicit keyspace/shard strategies work unchanged
  • The existing backup-schedule-test e2e test passes without modification

Testing

Unit Tests (26 new)

schedule_generator_test.go (11 tests):

  • Determinism: same inputs always produce same cron
  • Distribution: different shards produce different offsets
  • Common intervals: 30m, 1h, 2h, 4h, 6h, 8h, 12h, 24h all produce valid cron
  • Sub-minute rejection: frequencies < 1m return error
  • Unsupported interval rejection: non-cron-representable frequencies return error
  • All outputs parseable: exhaustive test across many intervals and shard names
  • Specific interval tests: daily, 6-hourly, hourly, sub-hourly cron expressions verified
  • Invalid offset handling

expand_strategy_test.go (12 tests):

  • Shard scope returns unchanged
  • Empty scope defaults to Shard
  • Keyspace scope expands to N strategies with correct names/fields
  • Cluster scope expands across all keyspaces
  • Auto-exclusion: Keyspace-scope in other schedules excludes from Cluster expansion
  • Self-exclusion: Keyspace-scope in same schedule excludes from Cluster expansion
  • Shard-scope does NOT trigger exclusion (additive)
  • Empty keyspace: gracefully returns 0 strategies
  • Naming uniqueness: all expanded strategy names are unique
  • BuildStrategyExpansionContext caches shards by keyspace
  • ShardReadyForScheduledBackup logic
  • CreateJobPod waits for shard bootstrap

vitessbackupschedule_methods_test.go (3 tests):

  • ValidateBackupFrequency: accepts valid frequencies, rejects invalid ones (sub-minute, 45m, 90m, 48h)
  • ValidateScheduleConfig rejects unsupported frequency strings
  • NewVitessBackupScheduleStatus initializes all maps

E2E Tests

backup_schedule_keyspace_test.sh (new):

  • Deploys a 2-keyspace cluster (commerce + customer) with:
    • Cluster-scope schedule (frequency: "1m")
    • Keyspace-scope schedule for customer (frequency: "1m")
  • Verifies VitessBackupSchedule resources are created
  • Verifies generatedSchedules appear in status (frequency-based scheduling works)
  • Verifies auto-exclusion: cluster-scope includes commerce, excludes customer
  • Verifies keyspace-scope includes customer
  • Verifies backup jobs complete for both scopes

And stagger them across shards to avoid resource usage
issues.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord added enhancement New feature or request feature New Feature labels Mar 8, 2026
mattlord added 4 commits March 8, 2026 22:00
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the VitessBackupSchedule feature to reduce operational overhead and bandwidth spikes in large Vitess deployments by introducing scope-based strategy expansion (Shard/Keyspace/Cluster) and frequency-based scheduling that deterministically staggers per-shard cron schedules.

Changes:

  • Add scope support to expand strategies at reconcile-time for Keyspace- and Cluster-wide backups, including auto-exclusion of keyspaces overridden by Keyspace-scope schedules.
  • Add frequency (Go duration) as an alternative to cron schedule, generating deterministic staggered per-shard cron expressions and surfacing them via status.
  • Update CRDs/docs and add unit + e2e coverage for the new scheduling/expansion behavior.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go Implements strategy expansion, frequency-derived schedules, new status fields, and shard bootstrap gating logic.
pkg/controller/vitessbackupschedule/schedule_generator.go Adds deterministic cron generation from a validated frequency.
pkg/apis/planetscale/v2/vitessbackupschedule_types.go Extends the API with BackupScope, frequency, and new status maps.
pkg/apis/planetscale/v2/vitessbackupschedule_methods.go Adds validation for schedule/frequency mutual exclusivity, supported frequencies, and scope/keyspace/shard consistency.
deploy/crds/planetscale.com_vitessbackupschedules.yaml Updates CRD schema for new fields and relaxed requirements.
deploy/crds/planetscale.com_vitessclusters.yaml Updates embedded schedule schema for new fields and relaxed requirements.
pkg/operator/vitessbackup/schedule.go Allows schedule creation when either schedule or frequency is set.
pkg/apis/planetscale/v2/zz_generated.deepcopy.go Deepcopy support for new status maps.
pkg/controller/vitessbackupschedule/schedule_generator_test.go Unit tests for frequency→cron generation.
pkg/controller/vitessbackupschedule/expand_strategy_test.go Unit tests for strategy expansion and exclusion semantics.
pkg/apis/planetscale/v2/vitessbackupschedule_methods_test.go Unit tests for new validation and status initialization.
test/endtoend/backup_schedule_keyspace_test.sh New e2e test for keyspace/cluster scope schedules and status observability.
test/endtoend/operator/102_initial_cluster_keyspace_backup_schedule.yaml New e2e fixture cluster with cluster+keyspace schedules using frequency.
test/endtoend/operator/401_scheduled_backups.yaml Extends existing e2e fixture with frequency + scope examples.
test/endtoend/operator/operator-latest.yaml Updates embedded CRDs used by e2e tests.
docs/api.md / docs/api/index.html Regenerated API docs for new fields/types.
Makefile Adds backup-schedule-keyspace-test target.
.buildkite/pipeline.yml Adds CI step to run the new e2e test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mattlord added 2 commits March 9, 2026 15:37
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the keyspace_level_backups branch from 6ed1a3d to 4fbc25a Compare March 9, 2026 19:38
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord marked this pull request as ready for review March 9, 2026 23:57
@mattlord
Copy link
Copy Markdown
Collaborator Author

mattlord commented Mar 9, 2026

/cc @bluecrabs007 please let me know if you have any feedback. Thanks! ❤️

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@bluecrabs007
Copy link
Copy Markdown

/cc @bluecrabs007 please let me know if you have any feedback. Thanks! ❤️

This looks pretty good, thank you!

@mattlord mattlord requested a review from mhamza15 March 10, 2026 23:55
Copy link
Copy Markdown
Contributor

@nickvanw nickvanw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left three inline comments for the correctness issues I found in the new exclusion/status paths.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Copy Markdown
Contributor

@nickvanw nickvanw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two follow-up comments for the remaining correctness issues.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Copy Markdown
Contributor

@nickvanw nickvanw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock.

Practical take: the happy path here looks good, and the focused tests plus package-level Go tests passed on my side. The remaining comments are edge-case semantics rather than happy-path regressions. If those edge cases matter for the intended support model, they’re worth a follow-up; otherwise I don’t think they should block merge.

if sched.Spec.Cluster != vbsc.Spec.Cluster || scheduleSuspended(sched) {
continue
}
if err := sched.Spec.VitessBackupScheduleTemplate.ValidateScheduleConfig(); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now skips peers that fail ValidateScheduleConfig() / ValidateStrategies(), which fixes the obvious invalid-override case. One edge case still remains: a peer schedule that passes those checks but would fail this controller's duplicate-effective-target validation can still contribute exclusions here. In that case the peer can never successfully reconcile, but it can still suppress cluster-scope coverage for the keyspace. If exclusion is meant to mean "this peer can actually take over coverage", it probably needs the same effective-target validity gate as the main reconcile path.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Peer schedules now have to pass the same duplicate-effective-target preflight before they can contribute exclusions, so a peer that can never reconcile no longer suppresses cluster-scope coverage. Covered by TestExpandStrategy_ClusterScopeDuplicateInvalidPeerDoesNotExclude.

}

func (r *ReconcileVitessBackupsSchedule) validateNoDuplicateEffectiveShardTargets(ctx context.Context, vbsc planetscalev2.VitessBackupSchedule) error {
expansionCtx, err := r.buildLocalStrategyExpansionContext(ctx, vbsc)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicate-target preflight only expands against local exclusions. That means a config like:

  • this schedule: Cluster + explicit Shard(customer/-80)
  • peer schedule: Keyspace(customer)

gets rejected here even though the real runtime expansion for this schedule would exclude customer because of the peer override. So this blocks a valid "cluster default + peer keyspace override + extra hot-shard backup" shape. If that additive model is intended, this validator probably needs to reason over the full exclusion context rather than only local exclusions.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Duplicate-target validation now uses the full exclusion context instead of only local exclusions, so peer keyspace overrides are taken into account before we reject additive shapes. Covered by TestReconcileStrategies_AllowsExplicitShardWhenPeerKeyspaceOverrideExcludesClusterExpansion.

@yydoow
Copy link
Copy Markdown

yydoow commented Mar 20, 2026

Link a feature request #767

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Copy Markdown
Member

@frouioui frouioui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's port the changes (to the api, and to the tests) to the vitess repository, otherwise lgtm! good work

@mattlord mattlord merged commit 987ee32 into main Mar 27, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request feature New Feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Keyspace-wide Backups & Flexible Scheduling in vitess-operator

6 participants