databricks-solutions · dgokeeffe · Apr 19, 2026 · Apr 19, 2026 · Apr 19, 2026
diff --git a/databricks-skills/databricks-mlflow-ml/SKILL.md b/databricks-skills/databricks-mlflow-ml/SKILL.md
@@ -0,0 +1,129 @@
+---
+name: databricks-mlflow-ml
+description: "Classic ML model lifecycle on Databricks with MLflow and Unity Catalog. Use when training scikit-learn / XGBoost / PyTorch models with MLflow tracking, registering models to Unity Catalog (three-level names, @champion / @challenger aliases), setting mlflow.set_registry_uri('databricks-uc'), logging experiments with UC volume artifact_location, loading registered models via mlflow.pyfunc.load_model or mlflow.pyfunc.spark_udf, and running batch inference (notebook or Lakeflow SDP pipeline). Not for GenAI agent evaluation — use databricks-mlflow-evaluation for that. Not for Model Serving endpoints — use databricks-model-serving for that."
+---
+
+# MLflow + Unity Catalog — Classic ML
+
+## Before Writing Any Code
+
+1. **Read `GOTCHAS.md`** — 12 common mistakes that cause silent failures or wasted time
+2. **Read `CRITICAL-interfaces.md`** — exact API signatures and the `models:/` URI format
+
+## End-to-End Workflows
+
+Follow the workflow that matches your goal. Each step indicates which reference files to read.
+
+### Workflow 1: Train → Register → Batch Score (most common)
+
+For building a production-shape classic ML model with UC-native lineage. Covers the full path from raw features to predictions in a downstream table.
+
+| Step | Action | Reference Files |
+|------|--------|-----------------|
+| 1 | Create experiment with UC volume artifact_location | `patterns-experiment-setup.md` (Pattern 1) |
+| 2 | Train model with signature + input_example | `patterns-training.md` (Patterns 1–3) |
+| 3 | Register to Unity Catalog with three-level name | `patterns-uc-registration.md` (Patterns 1–2) |
+| 4 | Set `@champion` alias | `patterns-uc-registration.md` (Pattern 3) |
+| 5 | Verify registration (Navigator check) | `patterns-uc-registration.md` (Pattern 4) + `GOTCHAS.md` #5 |
+| 6 | Load + score in notebook (Tier 1) | `patterns-batch-inference.md` (Patterns 1–2) |
+| 7 | Optional: Lakeflow SDP batch via `spark_udf` | `patterns-batch-inference.md` (Patterns 3–4) |
+
+### Workflow 2: Retrain + Promote (A/B pattern)
+
+For adding a new version of an already-registered model and promoting it without touching downstream loader code.
+
+| Step | Action | Reference Files |
+|------|--------|-----------------|
+| 1 | Train new version, log to same UC model name | `patterns-training.md` (Pattern 4) |
+| 2 | Register as new version | `patterns-uc-registration.md` (Pattern 2) |
+| 3 | Set `@challenger` alias | `patterns-uc-registration.md` (Pattern 3) |
+| 4 | Validate `@challenger` predictions vs `@champion` | `patterns-batch-inference.md` (Pattern 5) |
+| 5 | Swap aliases (`@challenger` → `@champion`) | `patterns-uc-registration.md` (Pattern 5) |
+
+Downstream loader code that uses `models:/catalog.schema.model@champion` picks up the new version on next load — no code change needed.
+
+### Workflow 3: Debugging a Failed Registration or Load
+
+For the two most common support questions: "why did my model go to workspace registry?" and "why does pyfunc.load_model fail?"
+
+| Step | Action | Reference Files |
+|------|--------|-----------------|
+| 1 | Verify registry URI is set to `databricks-uc` | `GOTCHAS.md` #1 |
+| 2 | Verify three-level name | `GOTCHAS.md` #2 |
+| 3 | Confirm model appears in Catalog Explorer | `patterns-uc-registration.md` (Pattern 4) |
+| 4 | Check `CREATE MODEL` permissions | `GOTCHAS.md` #7 |
+| 5 | Diagnose load failures | `GOTCHAS.md` #3, #8, #11 |
+
+## Quick Start
+
+The minimum viable path from untrained model to UC-registered, notebook-scored:
+
+```python
+import mlflow
+from mlflow.models import infer_signature
+from mlflow import MlflowClient
+
+# 1. Configure: UC registry + UC volume for artifacts (both required)
+mlflow.set_registry_uri("databricks-uc")
+mlflow.set_experiment(
+    experiment_name="/Users/me@company.com/forecasting",
+    artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting",
+)
+
+# 2. Train + log
+with mlflow.start_run() as run:
+    model.fit(X_train, y_train)
+    signature = infer_signature(X_train, model.predict(X_train[:5]))
+    mlflow.sklearn.log_model(
+        sk_model=model,
+        artifact_path="model",
+        signature=signature,
+        input_example=X_train.iloc[:5],
+    )
+
+# 3. Register + alias
+MODEL_NAME = "my_catalog.my_schema.my_model"
+result = mlflow.register_model(f"runs:/{run.info.run_id}/model", MODEL_NAME)
+MlflowClient().set_registered_model_alias(MODEL_NAME, "champion", result.version)
+
+# 4. Load + predict (in any notebook, anywhere)
+model = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion")
+predictions = model.predict(X_test)
+```
+
+## Why This Skill Exists
+
+Three skills in the AI Dev Kit touch MLflow; this one owns **classic ML training + UC registration + batch inference**. The distinction matters because the APIs diverged:
+
+| Skill | Scope | MLflow API Surface |
+|-------|-------|--------------------|
+| `databricks-mlflow-evaluation` | GenAI agent evaluation | `mlflow.genai.evaluate()`, scorers, judges, traces |
+| `databricks-model-serving` | Real-time serving endpoints | Deployment APIs, endpoint management, `ai_query` |
+| `databricks-mlflow-ml` *(this skill)* | Classic ML + UC registration + batch inference | `mlflow.sklearn.log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf` |
+
+If you're training a forecasting / classification / regression model, registering it to UC, and scoring it in a notebook or Lakeflow pipeline — this skill. If you're evaluating an LLM agent's output quality — evaluation skill. If you're exposing a model behind an HTTP endpoint — model-serving skill.
+
+## Common Issues
+
+| Issue | Solution |
+|-------|----------|
+| **Model registered but not visible in Catalog Explorer** | Missing `mlflow.set_registry_uri("databricks-uc")`. See `GOTCHAS.md` #1. |
+| **`RestException: INVALID_PARAMETER_VALUE` on `register_model`** | Two-level name used. UC requires `catalog.schema.name`. See `GOTCHAS.md` #2. |
+| **Experiment creation fails with storage errors** | Missing `artifact_location` pointing at a UC volume. See `GOTCHAS.md` #4. |
+| **`PERMISSION_DENIED: CREATE MODEL`** | Pair/user needs `CREATE MODEL ON SCHEMA <schema>`. See `GOTCHAS.md` #7. |
+| **`pyfunc.load_model` returns but `predict()` fails** | Signature wasn't logged; inputs don't coerce. See `GOTCHAS.md` #8. |
+| **Agent proposes `ai_query` for batch inference** | Wrong primitive — that requires a serving endpoint. Use `pyfunc.load_model` or `spark_udf`. See `GOTCHAS.md` #9. |
+
+## Reference Files
+
+- [`GOTCHAS.md`](references/GOTCHAS.md) — 12 common mistakes + fixes
+- [`CRITICAL-interfaces.md`](references/CRITICAL-interfaces.md) — API signatures + `models:/` URI format
+- [`patterns-experiment-setup.md`](references/patterns-experiment-setup.md) — experiment creation with UC volume artifact_location
+- [`patterns-training.md`](references/patterns-training.md) — logging models with signature + input_example + autologging
+- [`patterns-uc-registration.md`](references/patterns-uc-registration.md) — register + alias + verify + A/B promotion
+- [`patterns-batch-inference.md`](references/patterns-batch-inference.md) — notebook (`pyfunc.load_model`) + Lakeflow (`spark_udf`) + champion-vs-challenger
+- [`user-journeys.md`](references/user-journeys.md) — end-to-end workflows with decision points
+
+## Runtime compatibility
+
+Patterns verified against **MLflow 3.11** on **Lakeflow SDP serverless compute version 5** (default at time of writing). All APIs used (`set_registry_uri`, `log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf`) are compatible with MLflow 2.16+ as well, so the patterns work on older classic Databricks Runtimes that still ship 2.x. Where 3.x behaviour diverges (e.g., `artifact_path` deprecation → use `name=`), GOTCHAS.md calls it out.
diff --git a/databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md b/databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md
@@ -0,0 +1,219 @@
+# CRITICAL-interfaces — Exact API signatures
+
+The minimum set of APIs that every classic-ML + UC workflow touches. Copy-pasteable, with the exact arguments that matter.
+
+---
+
+## Registry URI configuration
+
+```python
+mlflow.set_registry_uri("databricks-uc")    # Call at the start of every session
+mlflow.get_registry_uri()                    # Returns "databricks-uc" if set correctly
+```
+
+**Must be called BEFORE** any `register_model` or `load_model` call. Idempotent to repeat.
+
+---
+
+## Experiment creation with UC volume artifact_location
+
+```python
+mlflow.set_experiment(
+    experiment_name="/Users/<email>/<experiment_name>",
+    artifact_location="dbfs:/Volumes/<catalog>/<schema>/<volume>/<path>",
+)
+```
+
+**`artifact_location` is required** for UC-enforced workspaces. The volume must exist:
+
+```sql
+CREATE VOLUME IF NOT EXISTS <catalog>.<schema>.<volume>;
+```
+
+---
+
+## `models:/` URI format
+
+All load / deploy / spark_udf calls use this URI. **One format to memorize:**
+
+```
+models:/<catalog>.<schema>.<model_name>@<alias>
+```
+
+Examples:
+```
+models:/my_catalog.my_schema.grocery_forecaster@champion
+models:/my_catalog.my_schema.grocery_forecaster@challenger
+```
+
+**Avoid** these forms (either legacy, or not-UC-native):
+```
+models:/grocery_forecaster/3                  # workspace registry, version number
+models:/my_schema.grocery_forecaster/3        # invalid in UC
+```
+
+---
+
+## Model logging (sklearn-flavored)
+
+```python
+mlflow.sklearn.log_model(
+    sk_model=<fitted_estimator_or_pipeline>,
+    artifact_path="model",                    # convention — keep as "model"
+    signature=<Signature>,                    # REQUIRED — use infer_signature()
+    input_example=<pandas_DataFrame>,         # REQUIRED — 5 real rows
+    registered_model_name=None,               # leave None; register separately (cleaner)
+    code_paths=<optional_list_of_dependency_files>,
+    extra_pip_requirements=<optional_list>,   # only if custom deps beyond environment
+)
+```
+
+**Signature inference:**
+```python
+from mlflow.models import infer_signature
+signature = infer_signature(X_train, model.predict(X_train[:5]))
+```
+
+**Other flavors with identical signature:**
+- `mlflow.xgboost.log_model(xgb_model=..., ...)`
+- `mlflow.pytorch.log_model(pytorch_model=..., ...)`
+- `mlflow.tensorflow.log_model(model=..., ...)`
+- `mlflow.pyfunc.log_model(python_model=..., artifact_path=..., ...)` — for custom PythonModel wrappers
+
+---
+
+## Explicit registration
+
+```python
+result = mlflow.register_model(
+    model_uri=f"runs:/{run_id}/model",        # "runs:/<run_id>/<artifact_path>"
+    name="<catalog>.<schema>.<model_name>",   # three-level, not optional
+    tags=<optional_dict>,
+)
+# result.name: str — fully qualified name
+# result.version: str — newly-created version (e.g., "1", "2")
+```
+
+---
+
+## Alias management
+
+```python
+from mlflow import MlflowClient
+client = MlflowClient()
+
+# Set (creates if missing, moves if exists)
+client.set_registered_model_alias(
+    name="<catalog>.<schema>.<model_name>",
+    alias="champion",                         # or "challenger", or custom
+    version="<version_number>",                # accepts str or int
+)
+
+# Get current alias mapping
+model = client.get_registered_model("<catalog>.<schema>.<model_name>")
+print(model.aliases)   # {"champion": "3", "challenger": "4"}
+
+# Delete
+client.delete_registered_model_alias(
+    name="<catalog>.<schema>.<model_name>",
+    alias="challenger",
+)
+```
+
+---
+
+## Loading — notebook / single-node
+
+```python
+model = mlflow.pyfunc.load_model(
+    model_uri="models:/<catalog>.<schema>.<model_name>@champion",
+)
+
+# Predict on a pandas DataFrame matching the signature
+predictions = model.predict(features_df)
+```
+
+**Returns:** `mlflow.pyfunc.PyFuncModel`, regardless of the original flavor. Expose `.metadata.signature` for schema.
+
+---
+
+## Loading — distributed / Lakeflow SDP
+
+```python
+predict_udf = mlflow.pyfunc.spark_udf(
+    spark,
+    model_uri="models:/<catalog>.<schema>.<model_name>@champion",
+    result_type="double",                     # or "array<double>" for multi-output
+    env_manager="local",                      # "local" | "virtualenv" | "conda"
+)
+
+# Apply to a Spark DataFrame
+df_with_predictions = df.withColumn(
+    "prediction",
+    predict_udf("feature_a", "feature_b", "feature_c"),
+)
+```
+
+**Construct ONCE at module scope** in Lakeflow pipelines. See `GOTCHAS.md` #11.
+
+---
+
+## Model introspection
+
+```python
+from mlflow.models import get_model_info
+
+info = get_model_info("models:/<catalog>.<schema>.<model_name>@champion")
+info.signature               # ModelSignature with inputs/outputs
+info.flavors                 # {"sklearn": {...}, "python_function": {...}}
+info.utc_time_created
+info.model_uuid
+```
+
+Useful when debugging load-vs-predict mismatches.
+
+---
+
+## Run + experiment queries (introspection)
+
+```python
+runs = mlflow.search_runs(
+    experiment_names=["/Users/me@company.com/forecasting"],
+    filter_string="metrics.r2 > 0.8",
+    order_by=["metrics.r2 DESC"],
+    max_results=5,
+)
+# Returns a pandas DataFrame with run_id, metrics, params, etc.
+
+best_run_id = runs.iloc[0]["run_id"]
+```
+
+---
+
+## SQL introspection (UC-native)
+
+```sql
+-- Does the model exist and which aliases are set?
+DESCRIBE MODEL <catalog>.<schema>.<model_name>;
+
+-- List all model versions
+SHOW MODEL VERSIONS ON MODEL <catalog>.<schema>.<model_name>;
+
+-- Check grants
+SHOW GRANTS ON MODEL <catalog>.<schema>.<model_name>;
+SHOW GRANTS ON SCHEMA <catalog>.<schema>;
+```
+
+---
+
+## What's NOT in this skill
+
+If you see these in code, you're likely in the wrong skill:
+
+| API | Belongs in |
+|-----|------------|
+| `mlflow.genai.evaluate(...)` | `databricks-mlflow-evaluation` |
+| `@scorer` decorator, `GuidelinesJudge`, etc. | `databricks-mlflow-evaluation` |
+| `databricks.sdk.service.serving.EndpointCoreConfigInput` | `databricks-model-serving` |
+| `ai_query('<custom-uc-model>', ...)` | Wrong pattern — use `pyfunc.load_model` or `spark_udf` instead (see `GOTCHAS.md` #9) |
+| `transition_model_version_stage(...)` | Deprecated — use aliases (see `GOTCHAS.md` #6) |