Skip to content

[Core feature] Add OAuth M2M and OIDC Workload Identity Federation Authentication Support to Databricks Connector #7319

@rohitrsh

Description

@rohitrsh

Motivation: Why do you think this is important?

Extend the Databricks connector (flytekitplugins-spark) with OAuth Machine-to-Machine (M2M) and OIDC Workload Identity Federation authentication, the two non-legacy authentication paths Databricks now recommends for service-to-service traffic. The connector currently only supports Personal Access Tokens (PATs).

Motivation

Databricks has marked PAT a legacy auth method

Databricks now classifies workspace-level PATs as legacy in their official documentation:

"Where possible, Databricks recommends using OAuth instead of PATs for user account authentication because OAuth provides stronger security." Source: Authenticate with Databricks personal access tokens (legacy)

Operational implications of staying on PAT:

  • 90-day auto-revocation: "Databricks automatically revokes PATs that haven't been used for 90 days."
  • Per-workspace cap: "A user can create up to 600 PATs per workspace."
  • No central rotation: PATs can't be rotated through an identity provider; every secret is a separate operational chore.
  • Weaker auditability: PATs are tied to a single workspace user/SP, with no IdP-level lineage on each call.

Goal of this issue

Bring the Databricks connector into the modern Databricks auth posture (OAuth M2M (client-credentials) and OIDC Workload Identity Federation (token exchange)) while preserving everything that the prior multi-tenant PAT work in flyteorg/flytekit#3394 delivered:

  • Per-namespace identity isolation (each Flyte workflow project federates as a different Databricks Service Principal, enabling per-project Unity Catalog access controls).
  • Backwards-compatible fallback to existing PAT setups.
  • Zero workflow-code changes for adoption: operators flip auth modes in connector config, workflow authors don't change a line.

Related prior work

  • flyteorg/flyte#6911: Databricks Serverless Compute support (separate track, also in flight).
  • flyteorg/flytekit#3394: Multi-tenant Databricks PAT via cross-namespace K8s secrets (merged Mar 10, 2026). This PR is the multi-tenancy baseline that the new auth modes preserve.
  • flyteorg/flytekit#3392: Databricks Serverless support (merged earlier, related task-config surface).

Proposed Changes

1. Auth strategy abstraction

Introduce a small strategy module (databricks_auth.py) inside the spark plugin that owns:

  • Resolution of the active auth type from connector env vars or per-task config (with auto-detection when unset).
  • Token acquisition for each strategy (PAT, OAuth M2M, OIDC Model 1, OIDC Model 2).
  • Async-safe in-memory token cache with TTL and a pre-expiry refresh buffer.
  • Exponential backoff with jitter on token-endpoint calls.

The connector continues to call a single get_header(...) boundary; the strategy underneath is replaceable.

2. Auth modes

Auth mode Credentials Multi-tenancy granularity Fallback target
PAT (existing) K8s Secret databricks-token in workflow namespace Per-namespace (per #3394) FLYTE_DATABRICKS_ACCESS_TOKEN env var
OAuth M2M K8s Secret databricks-oauth (client_id + client_secret) in workflow namespace Per-namespace Connector env vars FLYTE_DATABRICKS_CLIENT_ID / FLYTE_DATABRICKS_CLIENT_SECRET
OIDC Model 1 (Connector-pod identity) Connector pod's IRSA-projected JWT + connector-level DATABRICKS_CLIENT_ID Single shared Databricks SP across all workflows n/a
OIDC Model 2 (Per-namespace ServiceAccount) K8s TokenRequest minted for an annotated SA in the workflow namespace; SA carries flyte.org/databricks-client-id annotation Per-namespace (each namespace can federate to a distinct Databricks SP) OIDC Model 1 if no annotated SA found and Model 1 is configured; otherwise fail loudly

3. Auto-detection order

If FLYTE_DATABRICKS_AUTH_TYPE is unset, the connector picks the strongest reachable mode at submit time, in this order:

  1. OIDC if Model 2 SA discoverable in workflow namespace, OR Model 1 prerequisites are met.
  2. M2M if client_id/client_secret reachable (namespace secret or connector env).
  3. PAT as the final fallback for backwards compatibility.

If FLYTE_DATABRICKS_AUTH_TYPE is set explicitly, the connector uses that mode and errors loudly when its prerequisites are missing; no silent identity downgrade.

4. OIDC Model 2 discovery (per-namespace tenancy without per-task config)

Operators label/annotate ServiceAccounts in workflow namespaces:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-project-databricks
  namespace: my-project-namespace
  labels:
    flyte.org/databricks-enabled: "true"
  annotations:
    flyte.org/databricks-client-id: "00000000-0000-0000-0000-000000000000"  # Databricks SP application ID
    flyte.org/databricks-audience: "https://my-account.cloud.databricks.com/oidc/v1/token"  # optional

At submit time, the connector lists SAs in the workflow's namespace by label, picks the one carrying the flyte.org/databricks-client-id annotation, and mints a JWT for it via the Kubernetes TokenRequest API. The result is exchanged at the Databricks token endpoint for a workspace access token. Workflow authors write no extra config. Different namespaces can federate to different Databricks SPs in the same connector deployment.

Connector RBAC for Model 2 (get/list on serviceaccounts, plus create on serviceaccounts/token) is documented in the README.

5. Backwards compatibility

  • DatabricksJobMetadata and DatabricksV2 task config get additive fields only.
  • Existing PAT deployments continue to work unchanged.
  • Older DatabricksJobMetadata payloads (without the new fields) are still consumed correctly by upgraded connectors.

API Examples

Operators flip auth modes via connector env vars (workflow code unchanged)

# PAT (legacy, default; unchanged behaviour from #3394)
# (no extra env vars needed; works as today)

# OAuth M2M
FLYTE_DATABRICKS_AUTH_TYPE=oauth_m2m
FLYTE_DATABRICKS_CLIENT_ID=<sp-application-id>          # connector-level fallback
FLYTE_DATABRICKS_CLIENT_SECRET=<sp-secret>              # connector-level fallback
# Plus per-namespace K8s secret `databricks-oauth` for tenant overrides.

# OIDC Model 1 (connector-pod identity, shared)
FLYTE_DATABRICKS_AUTH_TYPE=oidc_federation
FLYTE_DATABRICKS_CLIENT_ID=<connector-pod-sp-id>
FLYTE_DATABRICKS_OIDC_AUDIENCE=https://...

# OIDC Model 2 (per-namespace, annotation-driven)
FLYTE_DATABRICKS_AUTH_TYPE=oidc_federation
# No connector-level client_id needed; discovered from each namespace's annotated SA.

Workflow code stays identical across all four modes

from flytekitplugins.spark import DatabricksV2
from flytekit import task

@task(
    task_config=DatabricksV2(
        databricks_conf={...},
        databricks_instance="my-workspace.cloud.databricks.com",
    ),
    container_image="my-image:tag",
)
def my_databricks_task(n: int) -> int:
    ...

Files Changed

File Change
plugins/flytekit-spark/flytekitplugins/spark/databricks_auth.py NEW. Strategy module: PAT / M2M / OIDC Model 1 / OIDC Model 2, async token cache, retry/backoff, auto-detection, validation
plugins/flytekit-spark/flytekitplugins/spark/connector.py M2M/OIDC integration; list_serviceaccounts_in_k8s helper; persist discovered config in DatabricksJobMetadata
plugins/flytekit-spark/flytekitplugins/spark/task.py New optional task-config fields surfaced for testability; auto-detection means workflow authors don't need them
plugins/flytekit-spark/tests/test_databricks_auth.py NEW. 100+ tests across all four auth modes including discovery, cache, error paths
plugins/flytekit-spark/tests/test_databricks_token.py Adjusted for new auth-resolution boundary; PAT regression coverage retained
plugins/flytekit-spark/tests/test_connector.py Updated for the additive DatabricksJobMetadata fields
plugins/flytekit-spark/README.md New "Databricks Connector Authentication" section: env var table, four-mode walkthrough, RBAC manifests, migration guide

Testing

Unit tests

100+ test cases covering:

  • Auth resolution / auto-detection: each mode selected correctly when its prerequisites are met; explicit errors when the requested mode is misconfigured (no silent downgrade).
  • PAT path: regression suite from Add support prepending or appending paths to flyte-binary ingresses #3394 retained as-is.
  • M2M: client-credentials flow, per-namespace secret read, fallback to env var, 401-driven cache invalidation on long-running jobs.
  • OIDC Model 1: IRSA token-file read, token exchange, error paths when projected token absent.
  • OIDC Model 2: SA discovery (single match, zero matches, multiple matches with ambiguity error), label filtering, annotation parsing, TTL discovery cache, RBAC failure surfaces.
  • Token cache: TTL behaviour, async-safety under concurrent gets, pre-expiry refresh buffer, 401-driven refresh.
  • API resilience: exponential backoff with jitter on the Databricks token endpoint.
  • Local dev: lazy imports for kubernetes and aiohttp so pyflyte run works without K8s present.

End-to-end (internal dev EKS cluster)

  • PAT regression: passing.
  • OAuth M2M: passing.
  • OIDC Model 2 (per-namespace, annotation-driven): passing.
  • OIDC Model 1: to be exercised before PR merge.

Migration Path

Today Tomorrow
Workspace-scoped PAT in FLYTE_DATABRICKS_ACCESS_TOKEN OAuth M2M with client_id / client_secret, OR OIDC federation via per-namespace SA annotations
Workspace-scoped Unity Catalog grants Per-namespace SP, per-namespace UC grants
90-day re-issuance chore IdP-managed credentials

Migration is opt-in and additive: existing deployments keep working until operators flip FLYTE_DATABRICKS_AUTH_TYPE.

References

FYI: @kumare3 @pingsutw @machichima

Goal: What should the final outcome look like, ideally?

The full feature has been implemented and tested end-to-end on an internal dev EKS cluster (PAT, M2M, OIDC Model 2 confirmed; Model 1 pending). I'm preparing a PR against flyteorg/flytekit:master that delivers the feature in a single coherent change. This issue is the tracking issue for that PR.

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Inline above. PR will land shortly with the full diff, RBAC manifests, and migration guide in the spark plugin README.

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

  • Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions