Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions databricks-skills/databricks-mlflow-ml/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
name: databricks-mlflow-ml
description: "Classic ML model lifecycle on Databricks with MLflow and Unity Catalog. Use when training scikit-learn / XGBoost / PyTorch models with MLflow tracking, registering models to Unity Catalog (three-level names, @champion / @challenger aliases), setting mlflow.set_registry_uri('databricks-uc'), logging experiments with UC volume artifact_location, loading registered models via mlflow.pyfunc.load_model or mlflow.pyfunc.spark_udf, and running batch inference (notebook or Lakeflow SDP pipeline). Not for GenAI agent evaluation — use databricks-mlflow-evaluation for that. Not for Model Serving endpoints — use databricks-model-serving for that."
---

# MLflow + Unity Catalog — Classic ML

## Before Writing Any Code

1. **Read `GOTCHAS.md`** — 12 common mistakes that cause silent failures or wasted time
2. **Read `CRITICAL-interfaces.md`** — exact API signatures and the `models:/` URI format

## End-to-End Workflows

Follow the workflow that matches your goal. Each step indicates which reference files to read.

### Workflow 1: Train → Register → Batch Score (most common)

For building a production-shape classic ML model with UC-native lineage. Covers the full path from raw features to predictions in a downstream table.

| Step | Action | Reference Files |
|------|--------|-----------------|
| 1 | Create experiment with UC volume artifact_location | `patterns-experiment-setup.md` (Pattern 1) |
| 2 | Train model with signature + input_example | `patterns-training.md` (Patterns 1–3) |
| 3 | Register to Unity Catalog with three-level name | `patterns-uc-registration.md` (Patterns 1–2) |
| 4 | Set `@champion` alias | `patterns-uc-registration.md` (Pattern 3) |
| 5 | Verify registration (Navigator check) | `patterns-uc-registration.md` (Pattern 4) + `GOTCHAS.md` #5 |
| 6 | Load + score in notebook (Tier 1) | `patterns-batch-inference.md` (Patterns 1–2) |
| 7 | Optional: Lakeflow SDP batch via `spark_udf` | `patterns-batch-inference.md` (Patterns 3–4) |

### Workflow 2: Retrain + Promote (A/B pattern)

For adding a new version of an already-registered model and promoting it without touching downstream loader code.

| Step | Action | Reference Files |
|------|--------|-----------------|
| 1 | Train new version, log to same UC model name | `patterns-training.md` (Pattern 4) |
| 2 | Register as new version | `patterns-uc-registration.md` (Pattern 2) |
| 3 | Set `@challenger` alias | `patterns-uc-registration.md` (Pattern 3) |
| 4 | Validate `@challenger` predictions vs `@champion` | `patterns-batch-inference.md` (Pattern 5) |
| 5 | Swap aliases (`@challenger` → `@champion`) | `patterns-uc-registration.md` (Pattern 5) |

Downstream loader code that uses `models:/catalog.schema.model@champion` picks up the new version on next load — no code change needed.

### Workflow 3: Debugging a Failed Registration or Load

For the two most common support questions: "why did my model go to workspace registry?" and "why does pyfunc.load_model fail?"

| Step | Action | Reference Files |
|------|--------|-----------------|
| 1 | Verify registry URI is set to `databricks-uc` | `GOTCHAS.md` #1 |
| 2 | Verify three-level name | `GOTCHAS.md` #2 |
| 3 | Confirm model appears in Catalog Explorer | `patterns-uc-registration.md` (Pattern 4) |
| 4 | Check `CREATE MODEL` permissions | `GOTCHAS.md` #7 |
| 5 | Diagnose load failures | `GOTCHAS.md` #3, #8, #11 |

## Quick Start

The minimum viable path from untrained model to UC-registered, notebook-scored:

```python
import mlflow
from mlflow.models import infer_signature
from mlflow import MlflowClient

# 1. Configure: UC registry + UC volume for artifacts (both required)
mlflow.set_registry_uri("databricks-uc")
mlflow.set_experiment(
experiment_name="/Users/me@company.com/forecasting",
artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting",
)

# 2. Train + log
with mlflow.start_run() as run:
model.fit(X_train, y_train)
signature = infer_signature(X_train, model.predict(X_train[:5]))
mlflow.sklearn.log_model(
sk_model=model,
artifact_path="model",
signature=signature,
input_example=X_train.iloc[:5],
)

# 3. Register + alias
MODEL_NAME = "my_catalog.my_schema.my_model"
result = mlflow.register_model(f"runs:/{run.info.run_id}/model", MODEL_NAME)
MlflowClient().set_registered_model_alias(MODEL_NAME, "champion", result.version)

# 4. Load + predict (in any notebook, anywhere)
model = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion")
predictions = model.predict(X_test)
```

## Why This Skill Exists

Three skills in the AI Dev Kit touch MLflow; this one owns **classic ML training + UC registration + batch inference**. The distinction matters because the APIs diverged:

| Skill | Scope | MLflow API Surface |
|-------|-------|--------------------|
| `databricks-mlflow-evaluation` | GenAI agent evaluation | `mlflow.genai.evaluate()`, scorers, judges, traces |
| `databricks-model-serving` | Real-time serving endpoints | Deployment APIs, endpoint management, `ai_query` |
| `databricks-mlflow-ml` *(this skill)* | Classic ML + UC registration + batch inference | `mlflow.sklearn.log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf` |

If you're training a forecasting / classification / regression model, registering it to UC, and scoring it in a notebook or Lakeflow pipeline — this skill. If you're evaluating an LLM agent's output quality — evaluation skill. If you're exposing a model behind an HTTP endpoint — model-serving skill.

## Common Issues

| Issue | Solution |
|-------|----------|
| **Model registered but not visible in Catalog Explorer** | Missing `mlflow.set_registry_uri("databricks-uc")`. See `GOTCHAS.md` #1. |
| **`RestException: INVALID_PARAMETER_VALUE` on `register_model`** | Two-level name used. UC requires `catalog.schema.name`. See `GOTCHAS.md` #2. |
| **Experiment creation fails with storage errors** | Missing `artifact_location` pointing at a UC volume. See `GOTCHAS.md` #4. |
| **`PERMISSION_DENIED: CREATE MODEL`** | Pair/user needs `CREATE MODEL ON SCHEMA <schema>`. See `GOTCHAS.md` #7. |
| **`pyfunc.load_model` returns but `predict()` fails** | Signature wasn't logged; inputs don't coerce. See `GOTCHAS.md` #8. |
| **Agent proposes `ai_query` for batch inference** | Wrong primitive — that requires a serving endpoint. Use `pyfunc.load_model` or `spark_udf`. See `GOTCHAS.md` #9. |

## Reference Files

- [`GOTCHAS.md`](references/GOTCHAS.md) — 12 common mistakes + fixes
- [`CRITICAL-interfaces.md`](references/CRITICAL-interfaces.md) — API signatures + `models:/` URI format
- [`patterns-experiment-setup.md`](references/patterns-experiment-setup.md) — experiment creation with UC volume artifact_location
- [`patterns-training.md`](references/patterns-training.md) — logging models with signature + input_example + autologging
- [`patterns-uc-registration.md`](references/patterns-uc-registration.md) — register + alias + verify + A/B promotion
- [`patterns-batch-inference.md`](references/patterns-batch-inference.md) — notebook (`pyfunc.load_model`) + Lakeflow (`spark_udf`) + champion-vs-challenger
- [`user-journeys.md`](references/user-journeys.md) — end-to-end workflows with decision points

## Runtime compatibility

Patterns verified against **MLflow 3.11** on **Lakeflow SDP serverless compute version 5** (default at time of writing). All APIs used (`set_registry_uri`, `log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf`) are compatible with MLflow 2.16+ as well, so the patterns work on older classic Databricks Runtimes that still ship 2.x. Where 3.x behaviour diverges (e.g., `artifact_path` deprecation → use `name=`), GOTCHAS.md calls it out.
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
# CRITICAL-interfaces — Exact API signatures

The minimum set of APIs that every classic-ML + UC workflow touches. Copy-pasteable, with the exact arguments that matter.

---

## Registry URI configuration

```python
mlflow.set_registry_uri("databricks-uc") # Call at the start of every session
mlflow.get_registry_uri() # Returns "databricks-uc" if set correctly
```

**Must be called BEFORE** any `register_model` or `load_model` call. Idempotent to repeat.

---

## Experiment creation with UC volume artifact_location

```python
mlflow.set_experiment(
experiment_name="/Users/<email>/<experiment_name>",
artifact_location="dbfs:/Volumes/<catalog>/<schema>/<volume>/<path>",
)
```

**`artifact_location` is required** for UC-enforced workspaces. The volume must exist:

```sql
CREATE VOLUME IF NOT EXISTS <catalog>.<schema>.<volume>;
```

---

## `models:/` URI format

All load / deploy / spark_udf calls use this URI. **One format to memorize:**

```
models:/<catalog>.<schema>.<model_name>@<alias>
```

Examples:
```
models:/my_catalog.my_schema.grocery_forecaster@champion
models:/my_catalog.my_schema.grocery_forecaster@challenger
```

**Avoid** these forms (either legacy, or not-UC-native):
```
models:/grocery_forecaster/3 # workspace registry, version number
models:/my_schema.grocery_forecaster/3 # invalid in UC
```

---

## Model logging (sklearn-flavored)

```python
mlflow.sklearn.log_model(
sk_model=<fitted_estimator_or_pipeline>,
artifact_path="model", # convention — keep as "model"
signature=<Signature>, # REQUIRED — use infer_signature()
input_example=<pandas_DataFrame>, # REQUIRED — 5 real rows
registered_model_name=None, # leave None; register separately (cleaner)
code_paths=<optional_list_of_dependency_files>,
extra_pip_requirements=<optional_list>, # only if custom deps beyond environment
)
```

**Signature inference:**
```python
from mlflow.models import infer_signature
signature = infer_signature(X_train, model.predict(X_train[:5]))
```

**Other flavors with identical signature:**
- `mlflow.xgboost.log_model(xgb_model=..., ...)`
- `mlflow.pytorch.log_model(pytorch_model=..., ...)`
- `mlflow.tensorflow.log_model(model=..., ...)`
- `mlflow.pyfunc.log_model(python_model=..., artifact_path=..., ...)` — for custom PythonModel wrappers

---

## Explicit registration

```python
result = mlflow.register_model(
model_uri=f"runs:/{run_id}/model", # "runs:/<run_id>/<artifact_path>"
name="<catalog>.<schema>.<model_name>", # three-level, not optional
tags=<optional_dict>,
)
# result.name: str — fully qualified name
# result.version: str — newly-created version (e.g., "1", "2")
```

---

## Alias management

```python
from mlflow import MlflowClient
client = MlflowClient()

# Set (creates if missing, moves if exists)
client.set_registered_model_alias(
name="<catalog>.<schema>.<model_name>",
alias="champion", # or "challenger", or custom
version="<version_number>", # accepts str or int
)

# Get current alias mapping
model = client.get_registered_model("<catalog>.<schema>.<model_name>")
print(model.aliases) # {"champion": "3", "challenger": "4"}

# Delete
client.delete_registered_model_alias(
name="<catalog>.<schema>.<model_name>",
alias="challenger",
)
```

---

## Loading — notebook / single-node

```python
model = mlflow.pyfunc.load_model(
model_uri="models:/<catalog>.<schema>.<model_name>@champion",
)

# Predict on a pandas DataFrame matching the signature
predictions = model.predict(features_df)
```

**Returns:** `mlflow.pyfunc.PyFuncModel`, regardless of the original flavor. Expose `.metadata.signature` for schema.

---

## Loading — distributed / Lakeflow SDP

```python
predict_udf = mlflow.pyfunc.spark_udf(
spark,
model_uri="models:/<catalog>.<schema>.<model_name>@champion",
result_type="double", # or "array<double>" for multi-output
env_manager="local", # "local" | "virtualenv" | "conda"
)

# Apply to a Spark DataFrame
df_with_predictions = df.withColumn(
"prediction",
predict_udf("feature_a", "feature_b", "feature_c"),
)
```

**Construct ONCE at module scope** in Lakeflow pipelines. See `GOTCHAS.md` #11.

---

## Model introspection

```python
from mlflow.models import get_model_info

info = get_model_info("models:/<catalog>.<schema>.<model_name>@champion")
info.signature # ModelSignature with inputs/outputs
info.flavors # {"sklearn": {...}, "python_function": {...}}
info.utc_time_created
info.model_uuid
```

Useful when debugging load-vs-predict mismatches.

---

## Run + experiment queries (introspection)

```python
runs = mlflow.search_runs(
experiment_names=["/Users/me@company.com/forecasting"],
filter_string="metrics.r2 > 0.8",
order_by=["metrics.r2 DESC"],
max_results=5,
)
# Returns a pandas DataFrame with run_id, metrics, params, etc.

best_run_id = runs.iloc[0]["run_id"]
```

---

## SQL introspection (UC-native)

```sql
-- Does the model exist and which aliases are set?
DESCRIBE MODEL <catalog>.<schema>.<model_name>;

-- List all model versions
SHOW MODEL VERSIONS ON MODEL <catalog>.<schema>.<model_name>;

-- Check grants
SHOW GRANTS ON MODEL <catalog>.<schema>.<model_name>;
SHOW GRANTS ON SCHEMA <catalog>.<schema>;
```

---

## What's NOT in this skill

If you see these in code, you're likely in the wrong skill:

| API | Belongs in |
|-----|------------|
| `mlflow.genai.evaluate(...)` | `databricks-mlflow-evaluation` |
| `@scorer` decorator, `GuidelinesJudge`, etc. | `databricks-mlflow-evaluation` |
| `databricks.sdk.service.serving.EndpointCoreConfigInput` | `databricks-model-serving` |
| `ai_query('<custom-uc-model>', ...)` | Wrong pattern — use `pyfunc.load_model` or `spark_udf` instead (see `GOTCHAS.md` #9) |
| `transition_model_version_stage(...)` | Deprecated — use aliases (see `GOTCHAS.md` #6) |
Loading