Skip to content

Conversation

@shreyas-goenka
Copy link
Contributor

@shreyas-goenka shreyas-goenka commented Nov 24, 2025

Changes

This PR implements granular updates for model serving endpoints. Instead of always updating the entire endpoint configuration, the CLI now only sends updates for the specific fields that have changed. This are:

  • AI Gateway configuration
  • Route configuration
  • Email notifications
  • Tags

Why

Model serving endpoint updates are expensive operations. By sending only the changed fields, we reduce the scope of updates and improve deployment performance.

We also don't have guarentees that these API calls are safe to do in parallel. This matches the TF implementation: https://github.com/databricks/terraform-provider-databricks/blob/b0a2a1c6a1688498fd6a00c64003ef4948da21e8/serving/resource_model_serving.go#L366

Tests

Added comprehensive acceptance tests for various update scenarios:

  • AI Gateway updates
  • Route config updates
  • Email notification updates
  • Tag updates
  • Combined updates (multiple fields at once)

Note: This PR depends on #3995 and should be merged after that PR is merged and this branch is rebased.

Note: These tests are local only because model serving endpoints take a long time (~30 minutes) to spin up and can be flaky. We can confirm though that the TF and DABs behavior matches.

@eng-dev-ecosystem-bot
Copy link
Collaborator

eng-dev-ecosystem-bot commented Nov 26, 2025

Commit: dbdbdf5

Run: 19760744157

Env 🟨​KNOWN 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 2 371 633 15:09
🟨​ aws windows 7 2 373 631 14:41
💚​ aws-ucws linux 7 2 514 518 18:30
💚​ aws-ucws windows 7 2 516 516 21:06
💚​ azure linux 1 4 371 632 16:14
💚​ azure windows 1 4 373 630 14:26
💚​ azure-ucws linux 1 4 510 517 17:54
💚​ azure-ucws windows 1 4 512 515 18:45
💚​ gcp linux 1 4 364 636 14:46
💚​ gcp windows 1 4 366 634 14:22
9 failing tests:
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/run/app-with-job 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
Top 20 slowest tests (at least 2 minutes):
duration env testname
5:46 aws-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:40 gcp windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:31 aws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
5:31 aws-ucws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
5:25 gcp linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
5:21 aws-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
5:17 aws-ucws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:09 gcp linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
5:08 aws linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
4:06 azure linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
3:57 azure linux TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
3:54 azure-ucws linux TestAccept/bundle/resources/synced_database_tables/basic
3:50 azure-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
3:44 azure windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
3:43 azure-ucws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
3:38 azure windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
2:52 aws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform
2:37 aws windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=direct
2:16 azure-ucws windows TestAccept/bundle/resources/synced_database_tables/basic
2:13 gcp windows TestAccept/bundle/resources/clusters/deploy/update-after-create/DATABRICKS_BUNDLE_ENGINE=terraform

- Merged latest main branch which renamed DoUpdateWithChanges to DoUpdate
- DoUpdate now has same signature with changes parameter
- Updated all resource implementations to use DoUpdate
- All model serving endpoint tests still passing
- Added 'type Changes = deployplan.Changes' alias in alert.go (same as main)
- Updated all DoUpdate signatures to use *Changes instead of *deployplan.Changes
- Removed unused deployplan imports where only the alias is used
- Reduces diff with main branch
@shreyas-goenka shreyas-goenka marked this pull request as ready for review November 27, 2025 13:38
trace update_file.py databricks.yml 'catalog_name: "first-inference-catalog"' 'catalog_name: "second-inference-catalog"'

trace $CLI bundle plan -o json > out.plan.$DATABRICKS_BUNDLE_ENGINE.json
trace $CLI bundle deploy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add 'trace $CLI bundle plan' after deploys to check that there is no drift.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are local server only, because of the slow speed to spin up an endpoint. Would you prefer if we make one of these cloud or is there value in still having bundle plan anyways?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's valuable locally as well but more so on cloud.

Perhaps we can enable one of them as CloudSlow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds good, can do for one.

var err error

if r.hasFieldChange(changes, "tags") {
err = r.updateTags(ctx, id, config.Tags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this updates no longer run in parallel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's not clear that these methods are safe to run in parallel so I'd rather be more conservative and do what terraform does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants