[Core feature] Add OAuth M2M and OIDC Workload Identity Federation Authentication Support to Databricks Connector

### Motivation: Why do you think this is important?

Extend the Databricks connector (`flytekitplugins-spark`) with **OAuth Machine-to-Machine (M2M)** and **OIDC Workload Identity Federation** authentication, the two non-legacy authentication paths Databricks now recommends for service-to-service traffic. The connector currently only supports Personal Access Tokens (PATs).

## Motivation

### Databricks has marked PAT a legacy auth method

Databricks now classifies workspace-level PATs as **legacy** in their official documentation:

> "Where possible, Databricks recommends using OAuth instead of PATs for user account authentication because OAuth provides stronger security." Source: [Authenticate with Databricks personal access tokens (legacy)](https://docs.databricks.com/aws/en/dev-tools/auth/pat)

Operational implications of staying on PAT:

* **90-day auto-revocation**: "Databricks automatically revokes PATs that haven't been used for 90 days."
* **Per-workspace cap**: "A user can create up to 600 PATs per workspace."
* **No central rotation**: PATs can't be rotated through an identity provider; every secret is a separate operational chore.
* **Weaker auditability**: PATs are tied to a single workspace user/SP, with no IdP-level lineage on each call.

### Goal of this issue

Bring the Databricks connector into the modern Databricks auth posture (**OAuth M2M (client-credentials)** and **OIDC Workload Identity Federation (token exchange)**) while preserving everything that the prior multi-tenant PAT work in [flyteorg/flytekit#3394](https://github.com/flyteorg/flytekit/pull/3394) delivered:

* Per-namespace identity isolation (each Flyte workflow project federates as a different Databricks Service Principal, enabling per-project Unity Catalog access controls).
* Backwards-compatible fallback to existing PAT setups.
* **Zero workflow-code changes** for adoption: operators flip auth modes in connector config, workflow authors don't change a line.

### Related prior work

* [flyteorg/flyte#6911](https://github.com/flyteorg/flyte/issues/6911): Databricks Serverless Compute support (separate track, also in flight).
* [flyteorg/flytekit#3394](https://github.com/flyteorg/flytekit/pull/3394): Multi-tenant Databricks PAT via cross-namespace K8s secrets (merged Mar 10, 2026). This PR is the multi-tenancy baseline that the new auth modes preserve.
* [flyteorg/flytekit#3392](https://github.com/flyteorg/flytekit/pull/3392): Databricks Serverless support (merged earlier, related task-config surface).

## Proposed Changes

### 1. Auth strategy abstraction

Introduce a small strategy module (`databricks_auth.py`) inside the spark plugin that owns:

* Resolution of the active auth type from connector env vars or per-task config (with auto-detection when unset).
* Token acquisition for each strategy (PAT, OAuth M2M, OIDC Model 1, OIDC Model 2).
* Async-safe in-memory token cache with TTL and a pre-expiry refresh buffer.
* Exponential backoff with jitter on token-endpoint calls.

The connector continues to call a single `get_header(...)` boundary; the strategy underneath is replaceable.

### 2. Auth modes

| Auth mode | Credentials | Multi-tenancy granularity | Fallback target |
| --- | --- | --- | --- |
| **PAT** (existing) | K8s `Secret` `databricks-token` in workflow namespace | Per-namespace (per #3394) | `FLYTE_DATABRICKS_ACCESS_TOKEN` env var |
| **OAuth M2M** | K8s `Secret` `databricks-oauth` (`client_id` + `client_secret`) in workflow namespace | Per-namespace | Connector env vars `FLYTE_DATABRICKS_CLIENT_ID` / `FLYTE_DATABRICKS_CLIENT_SECRET` |
| **OIDC Model 1** (Connector-pod identity) | Connector pod's IRSA-projected JWT + connector-level `DATABRICKS_CLIENT_ID` | Single shared Databricks SP across all workflows | n/a |
| **OIDC Model 2** (Per-namespace ServiceAccount) | K8s `TokenRequest` minted for an annotated SA in the workflow namespace; SA carries `flyte.org/databricks-client-id` annotation | Per-namespace (each namespace can federate to a distinct Databricks SP) | OIDC Model 1 if no annotated SA found and Model 1 is configured; otherwise fail loudly |

### 3. Auto-detection order

If `FLYTE_DATABRICKS_AUTH_TYPE` is unset, the connector picks the strongest reachable mode at submit time, in this order:

1. **OIDC** if Model 2 SA discoverable in workflow namespace, OR Model 1 prerequisites are met.
2. **M2M** if `client_id`/`client_secret` reachable (namespace secret or connector env).
3. **PAT** as the final fallback for backwards compatibility.

If `FLYTE_DATABRICKS_AUTH_TYPE` is set explicitly, the connector uses that mode and errors loudly when its prerequisites are missing; no silent identity downgrade.

### 4. OIDC Model 2 discovery (per-namespace tenancy without per-task config)

Operators label/annotate ServiceAccounts in workflow namespaces:

```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-project-databricks
  namespace: my-project-namespace
  labels:
    flyte.org/databricks-enabled: "true"
  annotations:
    flyte.org/databricks-client-id: "00000000-0000-0000-0000-000000000000"  # Databricks SP application ID
    flyte.org/databricks-audience: "https://my-account.cloud.databricks.com/oidc/v1/token"  # optional
```

At submit time, the connector lists SAs in the workflow's namespace by label, picks the one carrying the `flyte.org/databricks-client-id` annotation, and mints a JWT for it via the Kubernetes `TokenRequest` API. The result is exchanged at the Databricks token endpoint for a workspace access token. **Workflow authors write no extra config.** Different namespaces can federate to different Databricks SPs in the same connector deployment.

Connector RBAC for Model 2 (`get`/`list` on `serviceaccounts`, plus `create` on `serviceaccounts/token`) is documented in the README.

### 5. Backwards compatibility

* `DatabricksJobMetadata` and `DatabricksV2` task config get additive fields only.
* Existing PAT deployments continue to work unchanged.
* Older `DatabricksJobMetadata` payloads (without the new fields) are still consumed correctly by upgraded connectors.

## API Examples

### Operators flip auth modes via connector env vars (workflow code unchanged)

```bash
# PAT (legacy, default; unchanged behaviour from #3394)
# (no extra env vars needed; works as today)

# OAuth M2M
FLYTE_DATABRICKS_AUTH_TYPE=oauth_m2m
FLYTE_DATABRICKS_CLIENT_ID=<sp-application-id>          # connector-level fallback
FLYTE_DATABRICKS_CLIENT_SECRET=<sp-secret>              # connector-level fallback
# Plus per-namespace K8s secret `databricks-oauth` for tenant overrides.

# OIDC Model 1 (connector-pod identity, shared)
FLYTE_DATABRICKS_AUTH_TYPE=oidc_federation
FLYTE_DATABRICKS_CLIENT_ID=<connector-pod-sp-id>
FLYTE_DATABRICKS_OIDC_AUDIENCE=https://...

# OIDC Model 2 (per-namespace, annotation-driven)
FLYTE_DATABRICKS_AUTH_TYPE=oidc_federation
# No connector-level client_id needed; discovered from each namespace's annotated SA.
```

### Workflow code stays identical across all four modes

```python
from flytekitplugins.spark import DatabricksV2
from flytekit import task

@task(
    task_config=DatabricksV2(
        databricks_conf={...},
        databricks_instance="my-workspace.cloud.databricks.com",
    ),
    container_image="my-image:tag",
)
def my_databricks_task(n: int) -> int:
    ...
```

## Files Changed

| File | Change |
| --- | --- |
| `plugins/flytekit-spark/flytekitplugins/spark/databricks_auth.py` | **NEW**. Strategy module: PAT / M2M / OIDC Model 1 / OIDC Model 2, async token cache, retry/backoff, auto-detection, validation |
| `plugins/flytekit-spark/flytekitplugins/spark/connector.py` | M2M/OIDC integration; `list_serviceaccounts_in_k8s` helper; persist discovered config in `DatabricksJobMetadata` |
| `plugins/flytekit-spark/flytekitplugins/spark/task.py` | New optional task-config fields surfaced for testability; auto-detection means workflow authors don't need them |
| `plugins/flytekit-spark/tests/test_databricks_auth.py` | **NEW**. 100+ tests across all four auth modes including discovery, cache, error paths |
| `plugins/flytekit-spark/tests/test_databricks_token.py` | Adjusted for new auth-resolution boundary; PAT regression coverage retained |
| `plugins/flytekit-spark/tests/test_connector.py` | Updated for the additive `DatabricksJobMetadata` fields |
| `plugins/flytekit-spark/README.md` | New "Databricks Connector Authentication" section: env var table, four-mode walkthrough, RBAC manifests, migration guide |

## Testing

### Unit tests

100+ test cases covering:

* **Auth resolution / auto-detection**: each mode selected correctly when its prerequisites are met; explicit errors when the requested mode is misconfigured (no silent downgrade).
* **PAT path**: regression suite from #3394 retained as-is.
* **M2M**: client-credentials flow, per-namespace secret read, fallback to env var, 401-driven cache invalidation on long-running jobs.
* **OIDC Model 1**: IRSA token-file read, token exchange, error paths when projected token absent.
* **OIDC Model 2**: SA discovery (single match, zero matches, multiple matches with ambiguity error), label filtering, annotation parsing, TTL discovery cache, RBAC failure surfaces.
* **Token cache**: TTL behaviour, async-safety under concurrent gets, pre-expiry refresh buffer, 401-driven refresh.
* **API resilience**: exponential backoff with jitter on the Databricks token endpoint.
* **Local dev**: lazy imports for `kubernetes` and `aiohttp` so `pyflyte run` works without K8s present.

### End-to-end (internal dev EKS cluster)

* PAT regression: passing.
* OAuth M2M: passing.
* OIDC Model 2 (per-namespace, annotation-driven): passing.
* OIDC Model 1: to be exercised before PR merge.

## Migration Path

| Today | Tomorrow |
| --- | --- |
| Workspace-scoped PAT in `FLYTE_DATABRICKS_ACCESS_TOKEN` | OAuth M2M with `client_id` / `client_secret`, OR OIDC federation via per-namespace SA annotations |
| Workspace-scoped Unity Catalog grants | Per-namespace SP, per-namespace UC grants |
| 90-day re-issuance chore | IdP-managed credentials |

Migration is opt-in and additive: existing deployments keep working until operators flip `FLYTE_DATABRICKS_AUTH_TYPE`.

## References

* Databricks PAT (legacy): <https://docs.databricks.com/aws/en/dev-tools/auth/pat>
* Databricks OAuth M2M (client credentials): <https://docs.databricks.com/aws/en/dev-tools/auth/oauth-m2m>
* Databricks OIDC Workload Identity Federation: <https://docs.databricks.com/aws/en/dev-tools/auth/oauth-federation>
* Kubernetes ServiceAccount projected tokens / `TokenRequest` API: <https://kubernetes.io/docs/concepts/configuration/configmap/#mounted-configmaps-are-updated-automatically>
* Prior work, multi-tenant PAT: [flyteorg/flytekit#3394](https://github.com/flyteorg/flytekit/pull/3394)
* Tracked under (related), Databricks Serverless: [flyteorg/flyte#6911](https://github.com/flyteorg/flyte/issues/6911)

FYI: @kumare3 @pingsutw @machichima

### Goal: What should the final outcome look like, ideally?

The full feature has been implemented and tested end-to-end on an internal dev EKS cluster (PAT, M2M, OIDC Model 2 confirmed; Model 1 pending). I'm preparing a PR against `flyteorg/flytekit:master` that delivers the feature in a single coherent change. This issue is the tracking issue for that PR.

### Describe alternatives you've considered

* **Adding only OAuth M2M**: would address the legacy-PAT problem but not the auditability gap (still long-lived secrets).
* **Adding only OIDC Model 1**: secret-less, but collapses to a single shared Databricks SP, breaking the per-namespace UC tenancy goal from #6911 / #3394.
* **A per-task `databricks_oidc_service_account` field on `DatabricksV2`**: was prototyped, but it required workflow authors to know operator-level identity details and violated the zero-workflow-code-changes constraint. Replaced with the annotation-driven discovery design above.

### Propose: Link/Inline OR Additional context

Inline above. PR will land shortly with the full diff, RBAC manifests, and migration guide in the spark plugin README.

### Are you sure this issue hasn't been raised already?

* Yes. The only related issue is the multi-tenancy ask in #6911 (which #3394 partially addressed for PAT) and the serverless track (separate concern).

### Have you read the Code of Conduct?

* Yes


File	Change
`plugins/flytekit-spark/flytekitplugins/spark/databricks_auth.py`	NEW. Strategy module: PAT / M2M / OIDC Model 1 / OIDC Model 2, async token cache, retry/backoff, auto-detection, validation
`plugins/flytekit-spark/flytekitplugins/spark/connector.py`	M2M/OIDC integration; `list_serviceaccounts_in_k8s` helper; persist discovered config in `DatabricksJobMetadata`
`plugins/flytekit-spark/flytekitplugins/spark/task.py`	New optional task-config fields surfaced for testability; auto-detection means workflow authors don't need them
`plugins/flytekit-spark/tests/test_databricks_auth.py`	NEW. 100+ tests across all four auth modes including discovery, cache, error paths
`plugins/flytekit-spark/tests/test_databricks_token.py`	Adjusted for new auth-resolution boundary; PAT regression coverage retained
`plugins/flytekit-spark/tests/test_connector.py`	Updated for the additive `DatabricksJobMetadata` fields
`plugins/flytekit-spark/README.md`	New "Databricks Connector Authentication" section: env var table, four-mode walkthrough, RBAC manifests, migration guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core feature] Add OAuth M2M and OIDC Workload Identity Federation Authentication Support to Databricks Connector #7319

Motivation: Why do you think this is important?

Motivation

Databricks has marked PAT a legacy auth method

Goal of this issue

Related prior work

Proposed Changes

1. Auth strategy abstraction

2. Auth modes

3. Auto-detection order

4. OIDC Model 2 discovery (per-namespace tenancy without per-task config)

5. Backwards compatibility

API Examples

Operators flip auth modes via connector env vars (workflow code unchanged)

Workflow code stays identical across all four modes

Files Changed

Testing

Unit tests

End-to-end (internal dev EKS cluster)

Migration Path

References

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Auth mode	Credentials	Multi-tenancy granularity	Fallback target
PAT (existing)	K8s `Secret` `databricks-token` in workflow namespace	Per-namespace (per #3394)	`FLYTE_DATABRICKS_ACCESS_TOKEN` env var
OAuth M2M	K8s `Secret` `databricks-oauth` (`client_id` + `client_secret`) in workflow namespace	Per-namespace	Connector env vars `FLYTE_DATABRICKS_CLIENT_ID` / `FLYTE_DATABRICKS_CLIENT_SECRET`
OIDC Model 1 (Connector-pod identity)	Connector pod's IRSA-projected JWT + connector-level `DATABRICKS_CLIENT_ID`	Single shared Databricks SP across all workflows	n/a
OIDC Model 2 (Per-namespace ServiceAccount)	K8s `TokenRequest` minted for an annotated SA in the workflow namespace; SA carries `flyte.org/databricks-client-id` annotation	Per-namespace (each namespace can federate to a distinct Databricks SP)	OIDC Model 1 if no annotated SA found and Model 1 is configured; otherwise fail loudly

Today	Tomorrow
Workspace-scoped PAT in `FLYTE_DATABRICKS_ACCESS_TOKEN`	OAuth M2M with `client_id` / `client_secret`, OR OIDC federation via per-namespace SA annotations
Workspace-scoped Unity Catalog grants	Per-namespace SP, per-namespace UC grants
90-day re-issuance chore	IdP-managed credentials

[Core feature] Add OAuth M2M and OIDC Workload Identity Federation Authentication Support to Databricks Connector #7319

Description

Motivation: Why do you think this is important?

Motivation

Databricks has marked PAT a legacy auth method

Goal of this issue

Related prior work

Proposed Changes

1. Auth strategy abstraction

2. Auth modes

3. Auto-detection order

4. OIDC Model 2 discovery (per-namespace tenancy without per-task config)

5. Backwards compatibility

API Examples

Operators flip auth modes via connector env vars (workflow code unchanged)

Workflow code stays identical across all four modes

Files Changed

Testing

Unit tests

End-to-end (internal dev EKS cluster)

Migration Path

References

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions