diff --git a/databricks-skills/databricks-mlflow-ml/SKILL.md b/databricks-skills/databricks-mlflow-ml/SKILL.md new file mode 100644 index 00000000..cb3f7d0b --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/SKILL.md @@ -0,0 +1,129 @@ +--- +name: databricks-mlflow-ml +description: "Classic ML model lifecycle on Databricks with MLflow and Unity Catalog. Use when training scikit-learn / XGBoost / PyTorch models with MLflow tracking, registering models to Unity Catalog (three-level names, @champion / @challenger aliases), setting mlflow.set_registry_uri('databricks-uc'), logging experiments with UC volume artifact_location, loading registered models via mlflow.pyfunc.load_model or mlflow.pyfunc.spark_udf, and running batch inference (notebook or Lakeflow SDP pipeline). Not for GenAI agent evaluation — use databricks-mlflow-evaluation for that. Not for Model Serving endpoints — use databricks-model-serving for that." +--- + +# MLflow + Unity Catalog — Classic ML + +## Before Writing Any Code + +1. **Read `GOTCHAS.md`** — 12 common mistakes that cause silent failures or wasted time +2. **Read `CRITICAL-interfaces.md`** — exact API signatures and the `models:/` URI format + +## End-to-End Workflows + +Follow the workflow that matches your goal. Each step indicates which reference files to read. + +### Workflow 1: Train → Register → Batch Score (most common) + +For building a production-shape classic ML model with UC-native lineage. Covers the full path from raw features to predictions in a downstream table. + +| Step | Action | Reference Files | +|------|--------|-----------------| +| 1 | Create experiment with UC volume artifact_location | `patterns-experiment-setup.md` (Pattern 1) | +| 2 | Train model with signature + input_example | `patterns-training.md` (Patterns 1–3) | +| 3 | Register to Unity Catalog with three-level name | `patterns-uc-registration.md` (Patterns 1–2) | +| 4 | Set `@champion` alias | `patterns-uc-registration.md` (Pattern 3) | +| 5 | Verify registration (Navigator check) | `patterns-uc-registration.md` (Pattern 4) + `GOTCHAS.md` #5 | +| 6 | Load + score in notebook (Tier 1) | `patterns-batch-inference.md` (Patterns 1–2) | +| 7 | Optional: Lakeflow SDP batch via `spark_udf` | `patterns-batch-inference.md` (Patterns 3–4) | + +### Workflow 2: Retrain + Promote (A/B pattern) + +For adding a new version of an already-registered model and promoting it without touching downstream loader code. + +| Step | Action | Reference Files | +|------|--------|-----------------| +| 1 | Train new version, log to same UC model name | `patterns-training.md` (Pattern 4) | +| 2 | Register as new version | `patterns-uc-registration.md` (Pattern 2) | +| 3 | Set `@challenger` alias | `patterns-uc-registration.md` (Pattern 3) | +| 4 | Validate `@challenger` predictions vs `@champion` | `patterns-batch-inference.md` (Pattern 5) | +| 5 | Swap aliases (`@challenger` → `@champion`) | `patterns-uc-registration.md` (Pattern 5) | + +Downstream loader code that uses `models:/catalog.schema.model@champion` picks up the new version on next load — no code change needed. + +### Workflow 3: Debugging a Failed Registration or Load + +For the two most common support questions: "why did my model go to workspace registry?" and "why does pyfunc.load_model fail?" + +| Step | Action | Reference Files | +|------|--------|-----------------| +| 1 | Verify registry URI is set to `databricks-uc` | `GOTCHAS.md` #1 | +| 2 | Verify three-level name | `GOTCHAS.md` #2 | +| 3 | Confirm model appears in Catalog Explorer | `patterns-uc-registration.md` (Pattern 4) | +| 4 | Check `CREATE MODEL` permissions | `GOTCHAS.md` #7 | +| 5 | Diagnose load failures | `GOTCHAS.md` #3, #8, #11 | + +## Quick Start + +The minimum viable path from untrained model to UC-registered, notebook-scored: + +```python +import mlflow +from mlflow.models import infer_signature +from mlflow import MlflowClient + +# 1. Configure: UC registry + UC volume for artifacts (both required) +mlflow.set_registry_uri("databricks-uc") +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", +) + +# 2. Train + log +with mlflow.start_run() as run: + model.fit(X_train, y_train) + signature = infer_signature(X_train, model.predict(X_train[:5])) + mlflow.sklearn.log_model( + sk_model=model, + artifact_path="model", + signature=signature, + input_example=X_train.iloc[:5], + ) + +# 3. Register + alias +MODEL_NAME = "my_catalog.my_schema.my_model" +result = mlflow.register_model(f"runs:/{run.info.run_id}/model", MODEL_NAME) +MlflowClient().set_registered_model_alias(MODEL_NAME, "champion", result.version) + +# 4. Load + predict (in any notebook, anywhere) +model = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion") +predictions = model.predict(X_test) +``` + +## Why This Skill Exists + +Three skills in the AI Dev Kit touch MLflow; this one owns **classic ML training + UC registration + batch inference**. The distinction matters because the APIs diverged: + +| Skill | Scope | MLflow API Surface | +|-------|-------|--------------------| +| `databricks-mlflow-evaluation` | GenAI agent evaluation | `mlflow.genai.evaluate()`, scorers, judges, traces | +| `databricks-model-serving` | Real-time serving endpoints | Deployment APIs, endpoint management, `ai_query` | +| `databricks-mlflow-ml` *(this skill)* | Classic ML + UC registration + batch inference | `mlflow.sklearn.log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf` | + +If you're training a forecasting / classification / regression model, registering it to UC, and scoring it in a notebook or Lakeflow pipeline — this skill. If you're evaluating an LLM agent's output quality — evaluation skill. If you're exposing a model behind an HTTP endpoint — model-serving skill. + +## Common Issues + +| Issue | Solution | +|-------|----------| +| **Model registered but not visible in Catalog Explorer** | Missing `mlflow.set_registry_uri("databricks-uc")`. See `GOTCHAS.md` #1. | +| **`RestException: INVALID_PARAMETER_VALUE` on `register_model`** | Two-level name used. UC requires `catalog.schema.name`. See `GOTCHAS.md` #2. | +| **Experiment creation fails with storage errors** | Missing `artifact_location` pointing at a UC volume. See `GOTCHAS.md` #4. | +| **`PERMISSION_DENIED: CREATE MODEL`** | Pair/user needs `CREATE MODEL ON SCHEMA `. See `GOTCHAS.md` #7. | +| **`pyfunc.load_model` returns but `predict()` fails** | Signature wasn't logged; inputs don't coerce. See `GOTCHAS.md` #8. | +| **Agent proposes `ai_query` for batch inference** | Wrong primitive — that requires a serving endpoint. Use `pyfunc.load_model` or `spark_udf`. See `GOTCHAS.md` #9. | + +## Reference Files + +- [`GOTCHAS.md`](references/GOTCHAS.md) — 12 common mistakes + fixes +- [`CRITICAL-interfaces.md`](references/CRITICAL-interfaces.md) — API signatures + `models:/` URI format +- [`patterns-experiment-setup.md`](references/patterns-experiment-setup.md) — experiment creation with UC volume artifact_location +- [`patterns-training.md`](references/patterns-training.md) — logging models with signature + input_example + autologging +- [`patterns-uc-registration.md`](references/patterns-uc-registration.md) — register + alias + verify + A/B promotion +- [`patterns-batch-inference.md`](references/patterns-batch-inference.md) — notebook (`pyfunc.load_model`) + Lakeflow (`spark_udf`) + champion-vs-challenger +- [`user-journeys.md`](references/user-journeys.md) — end-to-end workflows with decision points + +## Runtime compatibility + +Patterns verified against **MLflow 3.11** on **Lakeflow SDP serverless compute version 5** (default at time of writing). All APIs used (`set_registry_uri`, `log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf`) are compatible with MLflow 2.16+ as well, so the patterns work on older classic Databricks Runtimes that still ship 2.x. Where 3.x behaviour diverges (e.g., `artifact_path` deprecation → use `name=`), GOTCHAS.md calls it out. diff --git a/databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md b/databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md new file mode 100644 index 00000000..a40483c5 --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md @@ -0,0 +1,219 @@ +# CRITICAL-interfaces — Exact API signatures + +The minimum set of APIs that every classic-ML + UC workflow touches. Copy-pasteable, with the exact arguments that matter. + +--- + +## Registry URI configuration + +```python +mlflow.set_registry_uri("databricks-uc") # Call at the start of every session +mlflow.get_registry_uri() # Returns "databricks-uc" if set correctly +``` + +**Must be called BEFORE** any `register_model` or `load_model` call. Idempotent to repeat. + +--- + +## Experiment creation with UC volume artifact_location + +```python +mlflow.set_experiment( + experiment_name="/Users//", + artifact_location="dbfs:/Volumes////", +) +``` + +**`artifact_location` is required** for UC-enforced workspaces. The volume must exist: + +```sql +CREATE VOLUME IF NOT EXISTS ..; +``` + +--- + +## `models:/` URI format + +All load / deploy / spark_udf calls use this URI. **One format to memorize:** + +``` +models:/..@ +``` + +Examples: +``` +models:/my_catalog.my_schema.grocery_forecaster@champion +models:/my_catalog.my_schema.grocery_forecaster@challenger +``` + +**Avoid** these forms (either legacy, or not-UC-native): +``` +models:/grocery_forecaster/3 # workspace registry, version number +models:/my_schema.grocery_forecaster/3 # invalid in UC +``` + +--- + +## Model logging (sklearn-flavored) + +```python +mlflow.sklearn.log_model( + sk_model=, + artifact_path="model", # convention — keep as "model" + signature=, # REQUIRED — use infer_signature() + input_example=, # REQUIRED — 5 real rows + registered_model_name=None, # leave None; register separately (cleaner) + code_paths=, + extra_pip_requirements=, # only if custom deps beyond environment +) +``` + +**Signature inference:** +```python +from mlflow.models import infer_signature +signature = infer_signature(X_train, model.predict(X_train[:5])) +``` + +**Other flavors with identical signature:** +- `mlflow.xgboost.log_model(xgb_model=..., ...)` +- `mlflow.pytorch.log_model(pytorch_model=..., ...)` +- `mlflow.tensorflow.log_model(model=..., ...)` +- `mlflow.pyfunc.log_model(python_model=..., artifact_path=..., ...)` — for custom PythonModel wrappers + +--- + +## Explicit registration + +```python +result = mlflow.register_model( + model_uri=f"runs:/{run_id}/model", # "runs://" + name="..", # three-level, not optional + tags=, +) +# result.name: str — fully qualified name +# result.version: str — newly-created version (e.g., "1", "2") +``` + +--- + +## Alias management + +```python +from mlflow import MlflowClient +client = MlflowClient() + +# Set (creates if missing, moves if exists) +client.set_registered_model_alias( + name="..", + alias="champion", # or "challenger", or custom + version="", # accepts str or int +) + +# Get current alias mapping +model = client.get_registered_model("..") +print(model.aliases) # {"champion": "3", "challenger": "4"} + +# Delete +client.delete_registered_model_alias( + name="..", + alias="challenger", +) +``` + +--- + +## Loading — notebook / single-node + +```python +model = mlflow.pyfunc.load_model( + model_uri="models:/..@champion", +) + +# Predict on a pandas DataFrame matching the signature +predictions = model.predict(features_df) +``` + +**Returns:** `mlflow.pyfunc.PyFuncModel`, regardless of the original flavor. Expose `.metadata.signature` for schema. + +--- + +## Loading — distributed / Lakeflow SDP + +```python +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri="models:/..@champion", + result_type="double", # or "array" for multi-output + env_manager="local", # "local" | "virtualenv" | "conda" +) + +# Apply to a Spark DataFrame +df_with_predictions = df.withColumn( + "prediction", + predict_udf("feature_a", "feature_b", "feature_c"), +) +``` + +**Construct ONCE at module scope** in Lakeflow pipelines. See `GOTCHAS.md` #11. + +--- + +## Model introspection + +```python +from mlflow.models import get_model_info + +info = get_model_info("models:/..@champion") +info.signature # ModelSignature with inputs/outputs +info.flavors # {"sklearn": {...}, "python_function": {...}} +info.utc_time_created +info.model_uuid +``` + +Useful when debugging load-vs-predict mismatches. + +--- + +## Run + experiment queries (introspection) + +```python +runs = mlflow.search_runs( + experiment_names=["/Users/me@company.com/forecasting"], + filter_string="metrics.r2 > 0.8", + order_by=["metrics.r2 DESC"], + max_results=5, +) +# Returns a pandas DataFrame with run_id, metrics, params, etc. + +best_run_id = runs.iloc[0]["run_id"] +``` + +--- + +## SQL introspection (UC-native) + +```sql +-- Does the model exist and which aliases are set? +DESCRIBE MODEL ..; + +-- List all model versions +SHOW MODEL VERSIONS ON MODEL ..; + +-- Check grants +SHOW GRANTS ON MODEL ..; +SHOW GRANTS ON SCHEMA .; +``` + +--- + +## What's NOT in this skill + +If you see these in code, you're likely in the wrong skill: + +| API | Belongs in | +|-----|------------| +| `mlflow.genai.evaluate(...)` | `databricks-mlflow-evaluation` | +| `@scorer` decorator, `GuidelinesJudge`, etc. | `databricks-mlflow-evaluation` | +| `databricks.sdk.service.serving.EndpointCoreConfigInput` | `databricks-model-serving` | +| `ai_query('', ...)` | Wrong pattern — use `pyfunc.load_model` or `spark_udf` instead (see `GOTCHAS.md` #9) | +| `transition_model_version_stage(...)` | Deprecated — use aliases (see `GOTCHAS.md` #6) | diff --git a/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md b/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md new file mode 100644 index 00000000..a2ab11d4 --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md @@ -0,0 +1,301 @@ +# GOTCHAS — Classic ML on MLflow + Unity Catalog + +Fourteen mistakes that silently waste hours. Read before writing any code. + +--- + +## 1. Missing `mlflow.set_registry_uri("databricks-uc")` → workspace registry + +**Symptom:** `register_model` succeeds, but the model doesn't appear in Catalog Explorer. It's in the legacy **workspace registry** (visible under the MLflow icon in the left nav), not Unity Catalog. + +**Fix:** +```python +import mlflow +mlflow.set_registry_uri("databricks-uc") # MUST come before register_model / load_model +``` + +**Verification:** +```python +assert mlflow.get_registry_uri() == "databricks-uc" +``` + +**Why it bites:** defaults still route to the workspace registry for backward compatibility. The only indicator you missed it is a URL that shows `/ml/models/` instead of `/explore/data/models///`. + +--- + +## 2. Two-level model names → rejected or wrong registry + +**Symptom:** `RestException: INVALID_PARAMETER_VALUE: Invalid model name`, or the model registers to the workspace registry silently. + +**Fix:** always use three-level names: `catalog.schema.model_name`. + +```python +# WRONG +mlflow.register_model(model_uri, "my_model") +mlflow.register_model(model_uri, "my_schema.my_model") + +# CORRECT +mlflow.register_model(model_uri, "my_catalog.my_schema.my_model") +``` + +**Why it bites:** the error message depends on the registry URI. With UC URI + two-level name → parameter error. With workspace URI + two-level name → registers successfully to workspace (the silently-wrong case). + +--- + +## 3. Loading with version number instead of alias + +**Symptom:** works today, breaks tomorrow when someone registers a new version. You've hard-coded a version number into every downstream consumer. + +**Fix:** load via alias, never version. + +```python +# FRAGILE — every retrain requires updating every loader +model = mlflow.pyfunc.load_model("models:/my_catalog.my_schema.my_model/3") + +# STABLE — promote a new version by moving @champion; no loader changes +model = mlflow.pyfunc.load_model("models:/my_catalog.my_schema.my_model@champion") +``` + +**Why it bites:** aliases are the UC-native way to decouple loader code from model lifecycle. Version numbers are legacy. New infrastructure (Lakeflow, Genie) assumes alias-based loading. + +--- + +## 4. Experiment creation without UC volume `artifact_location` + +**Symptom:** experiment creates, but any `log_model` call fails with storage / permission errors. Or artifacts land in DBFS root (deprecated) and can't be loaded downstream. + +**Fix:** when you create the experiment, pin it to a UC volume. + +```python +# Prerequisite: the UC volume must exist +# CREATE VOLUME my_catalog.my_schema.mlflow_artifacts; + +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", +) +``` + +**Why it bites:** the default `artifact_location` used to be DBFS root. Unity-Catalog-enforced workspaces reject DBFS root writes, so `log_model` fails with opaque errors. Pointing at a UC volume makes artifact storage first-class-governed and keeps lineage intact. + +**When the experiment already exists without a UC volume:** you can't retroactively change `artifact_location`. Either (a) delete + recreate, or (b) create a new experiment. Don't try to relocate artifacts manually. + +--- + +## 5. Trusting `register_model` success without verifying in UC + +**Symptom:** `register_model` returns a `ModelVersion` object. Feels successful. But the model is in workspace registry, or the version number is stale, or an alias wasn't set. + +**Fix:** always verify explicitly. + +```sql +-- In a SQL cell or notebook: +DESCRIBE MODEL my_catalog.my_schema.my_model; +``` + +Or via Python: +```python +from mlflow import MlflowClient +model = MlflowClient().get_registered_model("my_catalog.my_schema.my_model") +assert "champion" in model.aliases, "Missing @champion alias" +``` + +Or visually: open Catalog Explorer → `my_catalog` → `my_schema` → **Models** tab. If the model is under MLflow's workspace UI instead, you registered to the wrong place (see #1). + +**Why it bites:** `register_model`'s return value only tells you a version was created. It doesn't tell you *where* or *with what aliases*. The Navigator's V-step in pair programming: verify before trusting. + +--- + +## 6. Setting the alias to `"production"` or `"staging"` (legacy MLflow stages) + +**Symptom:** you remember MLflow had `stage="Production"` / `"Staging"` transitions. You try the same with aliases and nothing recognizes them. + +**Fix:** UC model aliases are free-form labels. The conventions are `@champion` (current winner) and `@challenger` (under evaluation). MLflow stages are deprecated in the UC registry. + +```python +# WRONG (legacy stage concept) +MlflowClient().set_registered_model_alias(name, "Production", version) + +# CORRECT +MlflowClient().set_registered_model_alias(name, "champion", version) +``` + +**Why it bites:** the old `transition_model_version_stage()` API still exists but is a no-op on UC-registered models. No error, no effect. + +--- + +## 7. Missing `CREATE MODEL ON SCHEMA` permission + +**Symptom:** `RestException: PERMISSION_DENIED: User ... does not have CREATE MODEL permission`. + +**Fix:** grant the permission at the schema level. + +```sql +GRANT CREATE MODEL ON SCHEMA my_catalog.my_schema TO `user@company.com`; +-- Or for a group: +GRANT CREATE MODEL ON SCHEMA my_catalog.my_schema TO `data-science-team`; +``` + +**Why it bites:** workspace admins often assume `USE SCHEMA` covers model registration. It doesn't — `CREATE MODEL` is a separate UC privilege that must be granted explicitly. + +**Verification:** +```sql +SHOW GRANTS ON SCHEMA my_catalog.my_schema; +``` + +--- + +## 8. Logging a model without `signature` or `input_example` + +**Symptom:** `mlflow.pyfunc.load_model(...)` returns an object, but `.predict(spark_df)` raises cryptic coercion errors. Or predictions silently cast (int → float, string → category) and produce wrong numbers. + +**Fix:** always log both. + +```python +from mlflow.models import infer_signature + +signature = infer_signature(X_train, model.predict(X_train[:5])) +mlflow.sklearn.log_model( + sk_model=model, + artifact_path="model", + signature=signature, + input_example=X_train.iloc[:5], # 5 real rows for the pyfunc wrapper to introspect +) +``` + +**Why it bites:** without a signature, the pyfunc wrapper can't coerce inputs — it accepts whatever you pass, then downstream operations (especially `spark_udf`) fail or produce wrong results. `input_example` is what `pyfunc.load_model` reads to build the wrapper's input coercer. + +--- + +## 9. `ai_query` used for batch inference on a custom UC model + +**Symptom:** you want batch inference on your custom-registered model. You see `ai_query()` in Genie docs and assume it works. It doesn't (for custom models) — `ai_query` only invokes **serving endpoints**, and your UC-registered model isn't behind one unless you deployed a serving endpoint for it. + +**Fix:** for batch inference, use `pyfunc.load_model` (notebook) or `pyfunc.spark_udf` (Lakeflow SDP pipeline). + +```python +# WRONG for custom UC models — requires a serving endpoint +spark.sql(f"SELECT ai_query('{MODEL_NAME}', features) FROM silver_features") + +# CORRECT — notebook batch (single node) +model = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion") +predictions = model.predict(features_pandas_df) + +# CORRECT — Lakeflow SDP batch (distributed) +predict_udf = mlflow.pyfunc.spark_udf(spark, f"models:/{MODEL_NAME}@champion", result_type="double") +silver_features.withColumn("prediction", predict_udf(*feature_cols)) +``` + +**Why it bites:** `ai_query` *is* the right call for Foundation Model API endpoints (`ai_query('databricks-dbrx-instruct', prompt)`). The naming overlap leads to wrong assumptions for custom models. + +--- + +## 10. Trying to delete / re-register a model at the same version number + +**Symptom:** `RestException: ALREADY_EXISTS` when re-registering. You can't reuse version numbers. + +**Fix:** UC versions are monotonically-increasing and immutable. To supersede a bad version, register a new version and move `@champion` to it. The old version stays in history for lineage. + +```python +new_result = mlflow.register_model(new_run_uri, MODEL_NAME) +MlflowClient().set_registered_model_alias(MODEL_NAME, "champion", new_result.version) +# Old version is still there; that's correct. Lineage preserved. +``` + +**Why it bites:** habits from the workspace registry (where deletion was forgiving) don't transfer. UC treats model versions as first-class auditable artifacts. + +--- + +## 11. `pyfunc.spark_udf` constructed inside a function call + +**Symptom:** in a Lakeflow SDP `@dp.materialized_view`, the UDF is constructed every time the view evaluates — slow and sometimes fails with serialization errors. + +**Fix:** construct the UDF at module scope, reuse it inside the view. + +```python +import mlflow +import databricks.declarative_pipelines as dp + +# Construct ONCE, at module scope +mlflow.set_registry_uri("databricks-uc") +predict_udf = mlflow.pyfunc.spark_udf( + spark, + f"models:/{MODEL_NAME}@champion", + result_type="double", +) + +@dp.materialized_view +def gold_forecast(): + return spark.read.table("silver_features").withColumn( + "prediction", + predict_udf("feat_a", "feat_b", "feat_c"), + ) +``` + +**Why it bites:** Lakeflow SDP may evaluate the function definition multiple times. Model deserialization is expensive — don't repeat it. + +--- + +## 12. `mlflow[databricks]` extras missing when running outside Databricks + +**Symptom:** training + logging works; `register_model` fails with `MlflowException: Unable to import necessary dependencies to access model version files in Unity Catalog` — root cause `ModuleNotFoundError: No module named 'azure'` (for Azure-hosted workspaces) or `'boto3'` (AWS) / `'google.cloud'` (GCP). + +**Fix:** install the `databricks` extras, which pull cloud-storage SDKs MLflow needs to stage artifacts into the UC-managed location. + +```bash +pip install 'mlflow[databricks]' +# or, for a lighter install: +pip install 'mlflow-skinny[databricks]' +``` + +**Why it bites:** plain `pip install mlflow` leaves out the cloud-provider SDKs because they're large and most local workflows don't need them. UC registration REQUIRES them because the registry stages artifacts into cloud-managed storage (Azure ADLS / S3 / GCS), and MLflow uses the provider's SDK for the upload. Local `log_model` works fine (artifacts go to the tracking server); registration doesn't. + +**When it most commonly hits:** running training scripts from a laptop, CI runner, or non-Databricks compute — anywhere that isn't a Databricks cluster (which ships the extras pre-installed). + +--- + +## 13. `artifact_path=` parameter is deprecated; new name is `name=` + +**Symptom:** warning in logs: `WARNING mlflow.models.model: `artifact_path` is deprecated. Please use `name` instead.` Still works today; may break in a future MLflow major version. + +**Fix:** use `name=` instead of `artifact_path=` in `log_model` calls. + +```python +# OLD (still works, warns) +mlflow.sklearn.log_model(sk_model=model, artifact_path="model", ...) + +# NEW (preferred, no warning) +mlflow.sklearn.log_model(sk_model=model, name="model", ...) +``` + +**Why it bites:** most online tutorials and training courses still use `artifact_path`. The rename shipped in MLflow 2.16. `name=` semantics are identical — still the within-run artifact folder. Aliases this to the preferred parameter, not a rename of what the parameter represents. + +--- + +## 14. Custom preprocessing not captured in the logged model + +**Symptom:** in the training notebook, predictions are accurate. After `pyfunc.load_model(...)`, predictions are garbage. The pipeline works in training because you're calling `scaler.transform()` manually; at inference time, nobody calls the scaler. + +**Fix:** wrap preprocessing + model in an `sklearn.pipeline.Pipeline` (or a custom `PythonModel` for non-sklearn preprocessing). Log the whole pipeline. + +```python +from sklearn.pipeline import Pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.ensemble import GradientBoostingRegressor + +pipeline = Pipeline([ + ("scaler", StandardScaler()), + ("model", GradientBoostingRegressor()), +]) +pipeline.fit(X_train, y_train) + +# Logs both the fitted scaler AND the model as a single artifact +mlflow.sklearn.log_model( + sk_model=pipeline, + artifact_path="model", + signature=infer_signature(X_train, pipeline.predict(X_train[:5])), + input_example=X_train.iloc[:5], +) +``` + +**Why it bites:** the most painful post-registration bug. Training and inference code paths are different files; the divergence is invisible until predictions are obviously wrong. diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-batch-inference.md b/databricks-skills/databricks-mlflow-ml/references/patterns-batch-inference.md new file mode 100644 index 00000000..ed4d86ae --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/patterns-batch-inference.md @@ -0,0 +1,244 @@ +# patterns-batch-inference + +Loading a UC-registered model and scoring features in batch. Two scales — interactive notebook (Pattern 1–2) and distributed Lakeflow pipeline (Patterns 3–4). Plus A/B validation (Pattern 5). + +--- + +## Pattern 1: Notebook batch inference — pandas path + +For interactive exploration, ad-hoc scoring, and sample sizes up to ~10k rows. + +```python +import mlflow + +mlflow.set_registry_uri("databricks-uc") + +model = mlflow.pyfunc.load_model( + "models:/my_catalog.my_schema.grocery_forecaster@champion" +) + +# Load a sample of features (LIMIT in SQL to avoid loading full table) +features = ( + spark.table("my_catalog.my_schema.silver_features") + .orderBy("month_date") + .limit(1000) + .toPandas() +) + +# The model's signature determines which columns it expects +feature_cols = model.metadata.get_input_schema().input_names() + +predictions = model.predict(features[feature_cols]) + +# Attach predictions for display/export +features["prediction"] = predictions +display(spark.createDataFrame(features)) +``` + +--- + +## Pattern 2: Notebook batch inference with chart + +Same pattern, adds a predicted-vs-actual visual. Useful as a demo artifact. + +```python +import matplotlib.pyplot as plt + +# (continuing from Pattern 1) +features_with_pred = features.sort_values("month_date") + +fig, ax = plt.subplots(figsize=(10, 5)) +ax.plot(features_with_pred["month_date"], features_with_pred["actual"], + label="Actual", linewidth=2) +ax.plot(features_with_pred["month_date"], features_with_pred["prediction"], + label="Predicted", linestyle="--", linewidth=2) +ax.set_xlabel("Month") +ax.set_ylabel("Turnover (millions)") +ax.set_title(f"Forecast — {model.metadata.run_id[:8]}") +ax.legend() +plt.xticks(rotation=45) +plt.tight_layout() +display(fig) +``` + +--- + +## Pattern 3: Lakeflow SDP batch via `spark_udf` + +For scheduled batch inference at scale. Distributes across Spark executors — no per-row Python overhead, no serving endpoint. + +```python +# src/gold/gold_forecast.py +import mlflow +import databricks.declarative_pipelines as dp + +# Construct the UDF ONCE at module scope — see GOTCHAS #11 +mlflow.set_registry_uri("databricks-uc") + +MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri=f"models:/{MODEL_NAME}@champion", + result_type="double", + env_manager="local", # "local" avoids conda/virtualenv setup overhead +) + +@dp.materialized_view( + comment="Grocery turnover forecast from @champion model", +) +def gold_forecast(): + return ( + spark.read.table("my_catalog.my_schema.silver_features") + .withColumn( + "forecast_turnover_millions", + predict_udf( + "turnover_lag_1", + "turnover_lag_12", + "rolling_3m_avg", + "state_share_of_national", + # ... pass each signature input column in the order the signature declares + ), + ) + ) +``` + +**What this gives you:** +- A `gold_forecast` table that refreshes on every pipeline run +- Distributed scoring (no serving endpoint, no auth token) +- Full UC lineage: `silver_features` → `gold_forecast` via `grocery_forecaster@champion` +- Genie can query it: *"what's the forecast for each state next month?"* + +--- + +## Pattern 4: `spark_udf` with `result_type` for multi-output models + +Multi-output regressors or classifiers need a richer result type. + +```python +from pyspark.sql.types import ArrayType, DoubleType, StructType, StructField + +# Multi-output regression — model returns 2 predictions per row +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri=f"models:/{MODEL_NAME}@champion", + result_type=ArrayType(DoubleType()), +) + +# Classifier with probabilities +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri=f"models:/{MODEL_NAME}@champion", + result_type=StructType([ + StructField("class", StringType(), True), + StructField("confidence", DoubleType(), True), + ]), +) +``` + +--- + +## Pattern 5: A/B validation — compare `@challenger` vs `@champion` + +Run both models on a validation set, compare error metrics, decide whether to promote. + +```python +import mlflow +from sklearn.metrics import mean_absolute_error, root_mean_squared_error + +mlflow.set_registry_uri("databricks-uc") +MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" + +champion = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion") +challenger = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@challenger") + +# Hold-out validation set (not seen during training) +validation = spark.table(f"{MODEL_NAME.rsplit('.', 1)[0]}.validation_features").toPandas() +feature_cols = champion.metadata.get_input_schema().input_names() +actuals = validation["turnover_millions"] + +champion_preds = champion.predict(validation[feature_cols]) +challenger_preds = challenger.predict(validation[feature_cols]) + +print(f"Champion RMSE: {root_mean_squared_error(actuals, champion_preds):.2f}") +print(f"Challenger RMSE: {root_mean_squared_error(actuals, challenger_preds):.2f}") +print(f"Champion MAE: {mean_absolute_error(actuals, champion_preds):.2f}") +print(f"Challenger MAE: {mean_absolute_error(actuals, challenger_preds):.2f}") + +# Decision logic — promote if challenger beats champion by >2% +if root_mean_squared_error(actuals, challenger_preds) < root_mean_squared_error(actuals, champion_preds) * 0.98: + print("→ Promote @challenger. See patterns-uc-registration.md Pattern 5.") +else: + print("→ Keep @champion. Delete @challenger.") +``` + +--- + +## Pattern 6: Structured streaming inference + +For models scoring events as they arrive (not batch-scheduled). + +```python +from pyspark.sql.functions import col + +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri=f"models:/{MODEL_NAME}@champion", + result_type="double", +) + +events = ( + spark.readStream + .format("delta") + .table("my_catalog.my_schema.silver_events") +) + +scored = events.withColumn( + "prediction", + predict_udf(*[col(c) for c in feature_cols]), +) + +( + scored.writeStream + .format("delta") + .outputMode("append") + .option("checkpointLocation", "dbfs:/Volumes/my_catalog/my_schema/checkpoints/scoring") + .toTable("my_catalog.my_schema.gold_scored_events") +) +``` + +For most classic-ML batch use cases, Pattern 3 (Lakeflow SDP) is simpler. Use streaming only when event-time scoring matters. + +--- + +## What NOT to do for batch inference + +### Do not use `ai_query` for custom UC models + +`ai_query('', )` requires the model to be deployed as a **Model Serving endpoint**. UC-registered models are NOT automatically behind an endpoint. Use `pyfunc.load_model` (Pattern 1) or `pyfunc.spark_udf` (Pattern 3) instead. + +`ai_query` IS the right call for: +- Foundation Model API endpoints: `ai_query('databricks-dbrx-instruct', prompt)` +- Model Serving endpoints you've explicitly provisioned + +See `GOTCHAS.md` #9. + +### Do not use `mlflow.pyfunc.load_model` for billion-row batches on a single node + +Pattern 1 collects to pandas — fine up to ~10k rows, painful beyond ~100k, impossible for millions. For distributed scale, use Pattern 3 (`spark_udf`). + +### Do not construct `spark_udf` inside the function body + +See `GOTCHAS.md` #11. Construct once at module scope, reuse inside `@dp.materialized_view` / `@dp.table`. + +--- + +## Troubleshooting batch inference + +| Error | Cause | Fix | +|-------|-------|-----| +| `RESOURCE_DOES_NOT_EXIST` on load | Wrong registry URI or two-level name | `GOTCHAS.md` #1, #2 | +| Predictions are NaN | Input columns in wrong order | Pass columns in the order `model.metadata.get_input_schema().input_names()` declares | +| `PERMISSION_DENIED: EXECUTE ON MODEL` | No read access to model | `GRANT EXECUTE ON MODEL ... TO ` | +| `spark_udf` raises `PicklingError` | Model has un-picklable state (e.g., Spark session) | Re-train ensuring the model is pure Python/numpy — don't capture `spark` at training time | +| Pipeline hangs on `gold_forecast` | Model artifact is large; first load is slow | Normal — subsequent runs are fast (UDF is cached per executor) | +| Column type mismatch in Spark | UDF expects double; column is int/string | Cast explicitly: `col("feature").cast("double")` | diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-experiment-setup.md b/databricks-skills/databricks-mlflow-ml/references/patterns-experiment-setup.md new file mode 100644 index 00000000..00c6e2ba --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/patterns-experiment-setup.md @@ -0,0 +1,141 @@ +# patterns-experiment-setup + +Experiments in UC-enforced workspaces need more setup than older MLflow guides show. The critical change: you must pin the experiment's `artifact_location` to a Unity Catalog volume, or `log_model` will fail with storage errors. + +--- + +## Pattern 1: Create experiment with UC volume artifact_location + +```python +import mlflow + +mlflow.set_registry_uri("databricks-uc") # always first + +# Prerequisite: the UC volume must exist +# CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.mlflow_artifacts; + +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", +) +``` + +**Why both are required:** +- `experiment_name` — the workspace-visible path (browsable from the Experiments UI) +- `artifact_location` — where logged artifacts (model binaries, plots, datasets) physically live + +In older workspaces, `artifact_location` defaulted to DBFS root. UC-enforced workspaces reject DBFS root writes, so `log_model` fails with opaque errors like: + +``` +MlflowException: API request to endpoint /api/2.0/mlflow/runs/log-artifact failed +with error code 403 != 200. Response body: PERMISSION_DENIED ... +``` + +Pointing at a UC volume resolves this AND makes artifacts first-class-governed under UC lineage. + +--- + +## Pattern 2: Create the volume if it doesn't exist (idempotent) + +Run once per schema, before any experiment creation: + +```python +spark.sql(f""" + CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.mlflow_artifacts + COMMENT 'MLflow experiment artifacts for forecasting models' +""") +``` + +Or via SQL editor: + +```sql +CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.mlflow_artifacts; +``` + +**Permissions needed:** `USE SCHEMA` + `CREATE VOLUME`. If missing, request `CREATE VOLUME ON SCHEMA my_catalog.my_schema` from the schema owner. + +--- + +## Pattern 3: Experiment already exists, wrong `artifact_location` + +You can't retroactively change `artifact_location`. Three options, in order of preference: + +**Option A — New experiment** (cleanest, keeps old runs intact): +```python +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting_v2", # v2 suffix + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting_v2", +) +# New runs land in v2. Old runs stay in v1 (archive them if you like). +``` + +**Option B — Delete + recreate** (loses history; use only if no good runs exist): +```python +from mlflow import MlflowClient +client = MlflowClient() + +exp = client.get_experiment_by_name("/Users/me@company.com/forecasting") +client.delete_experiment(exp.experiment_id) + +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", +) +``` + +**Option C — Manual relocation of DBFS artifacts to UC volume**: do not do this. Storage paths are resolved at log time and encoded in the run's metadata; moving files doesn't update the pointers. + +--- + +## Pattern 4: Verify experiment is correctly configured + +After setup, before training: + +```python +exp = mlflow.get_experiment_by_name("/Users/me@company.com/forecasting") +assert exp is not None, "Experiment not created" +assert exp.artifact_location.startswith("dbfs:/Volumes/"), ( + f"artifact_location is not a UC volume: {exp.artifact_location}" +) +print(f"Experiment ID: {exp.experiment_id}") +print(f"Artifact location: {exp.artifact_location}") +``` + +If the assert fails, you have an old experiment pointing at DBFS root. Apply Pattern 3. + +--- + +## Pattern 5: Workspace-path vs Repo-path experiments + +MLflow accepts two conventions for `experiment_name`: + +```python +# Workspace-path convention (recommended for collaborative experiments) +mlflow.set_experiment(experiment_name="/Users/me@company.com/forecasting") + +# Repo-path convention (only if you're running from a Git folder) +mlflow.set_experiment(experiment_name="/Repos/me@company.com/my-repo/forecasting") +``` + +**Prefer workspace path** for experiments shared across pairs/teams. Repo-path experiments become orphans when the repo is deleted. + +**Both need `artifact_location` pointing at a UC volume.** The path convention only affects where the experiment metadata is browsable, not where artifacts live. + +--- + +## Pattern 6: Running from a notebook cell with autoselected experiment + +Databricks notebooks auto-associate runs with an experiment matching the notebook's workspace path: + +```python +# In a notebook at /Users/me@company.com/Notebooks/train.py +# Databricks will auto-set experiment_name to the notebook path +# BUT the default artifact_location is still DBFS root — you still need to override: + +mlflow.set_experiment( + experiment_name="/Users/me@company.com/Notebooks/train", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/train", +) +``` + +Or call `set_experiment` explicitly before the first `start_run` — the artifact_location fix must be applied regardless of notebook auto-association. diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-training.md b/databricks-skills/databricks-mlflow-ml/references/patterns-training.md new file mode 100644 index 00000000..017e3cfb --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/patterns-training.md @@ -0,0 +1,205 @@ +# patterns-training + +How to log classic ML models (sklearn / XGBoost / PyTorch) so they register cleanly and load correctly downstream. The two load-bearing decisions: `signature` and `input_example`. + +--- + +## Pattern 1: Baseline sklearn training loop + +```python +import mlflow +import mlflow.sklearn +from sklearn.ensemble import GradientBoostingRegressor +from sklearn.metrics import root_mean_squared_error, mean_absolute_error +from sklearn.model_selection import train_test_split +from mlflow.models import infer_signature + +mlflow.set_registry_uri("databricks-uc") +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", +) + +X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2) + +with mlflow.start_run(run_name="gbr_baseline"): + model = GradientBoostingRegressor(n_estimators=100, max_depth=3) + model.fit(X_train, y_train) + + # Signature + input_example are both load-bearing + signature = infer_signature(X_train, model.predict(X_train[:5])) + + mlflow.sklearn.log_model( + sk_model=model, + artifact_path="model", + signature=signature, + input_example=X_train.iloc[:5], + ) + + # Log everything needed to reproduce + mlflow.log_params({"n_estimators": 100, "max_depth": 3}) + predictions = model.predict(X_test) + mlflow.log_metrics({ + "rmse": root_mean_squared_error(y_test, predictions), + "mae": mean_absolute_error(y_test, predictions), + }) +``` + +--- + +## Pattern 2: Preprocessing + model as a Pipeline + +Always log preprocessing alongside the model. See `GOTCHAS.md` #12 — inference-time preprocessing drift is the most painful post-registration bug. + +```python +from sklearn.pipeline import Pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.compose import ColumnTransformer + +numeric_features = ["turnover_lag_1", "turnover_lag_12", "rolling_3m_avg"] +categorical_features = ["state", "industry"] + +preprocessor = ColumnTransformer([ + ("num", StandardScaler(), numeric_features), + ("cat", "passthrough", categorical_features), # handle in the model if needed +]) + +pipeline = Pipeline([ + ("preprocessor", preprocessor), + ("model", GradientBoostingRegressor(n_estimators=100)), +]) + +with mlflow.start_run(): + pipeline.fit(X_train, y_train) + + signature = infer_signature(X_train, pipeline.predict(X_train[:5])) + mlflow.sklearn.log_model( + sk_model=pipeline, # logs both preprocessor AND model as one artifact + artifact_path="model", + signature=signature, + input_example=X_train.iloc[:5], + ) +``` + +At inference time, callers never need to know about `StandardScaler` — they pass raw features, `pyfunc.load_model` dispatches through the pipeline. + +--- + +## Pattern 3: XGBoost / PyTorch — same interface, different flavor + +```python +# XGBoost +import mlflow.xgboost +import xgboost as xgb + +model = xgb.XGBRegressor(n_estimators=100, max_depth=3) +model.fit(X_train, y_train) + +with mlflow.start_run(): + mlflow.xgboost.log_model( + xgb_model=model, + artifact_path="model", + signature=infer_signature(X_train, model.predict(X_train[:5])), + input_example=X_train.iloc[:5], + ) + +# PyTorch +import mlflow.pytorch +import torch + +class Forecaster(torch.nn.Module): + ... + +model = Forecaster() +# ... training loop ... + +with mlflow.start_run(): + # For PyTorch, input_example must be a tensor or numpy array + example = X_train.iloc[:5].to_numpy() + mlflow.pytorch.log_model( + pytorch_model=model, + artifact_path="model", + signature=infer_signature(example, model(torch.tensor(example)).detach().numpy()), + input_example=example, + ) +``` + +--- + +## Pattern 4: Retraining — same experiment, new run + +Retraining for an A/B test or a scheduled refresh. Log to the same experiment; register as a new version in Workflow 2. + +```python +with mlflow.start_run(run_name="gbr_v2_with_seasonality") as run: + model = GradientBoostingRegressor(n_estimators=200, max_depth=4) + model.fit(X_train_with_seasonality, y_train) + + mlflow.sklearn.log_model( + sk_model=model, + artifact_path="model", + signature=infer_signature(X_train_with_seasonality, + model.predict(X_train_with_seasonality[:5])), + input_example=X_train_with_seasonality.iloc[:5], + ) + # Remember the run_id for the register step + print(f"New run: {run.info.run_id}") +``` + +--- + +## Pattern 5: Autologging (quick path for iteration) + +Autologging wraps `fit()` and logs params + metrics + model automatically. Convenient during experimentation; less explicit than manual logging. + +```python +mlflow.sklearn.autolog( + log_models=True, + log_input_examples=True, # IMPORTANT — otherwise no input_example is captured + log_model_signatures=True, # IMPORTANT — otherwise no signature is captured + silent=False, +) + +# Any subsequent fit() call auto-logs +model = GradientBoostingRegressor(n_estimators=100) +model.fit(X_train, y_train) +# Autolog handled the MLflow calls +``` + +**Caveat:** autologging infers signature + input_example heuristically. For production runs, prefer manual logging (Pattern 1) — you control what gets captured. + +--- + +## Pattern 6: Searching runs to pick the best one for registration + +Before registering, you typically want the best run from an experiment: + +```python +runs = mlflow.search_runs( + experiment_names=["/Users/me@company.com/forecasting"], + filter_string="metrics.rmse < 100 AND tags.mlflow.runName LIKE 'gbr_%'", + order_by=["metrics.rmse ASC"], + max_results=1, +) + +if runs.empty: + raise RuntimeError("No runs match criteria") + +best_run_id = runs.iloc[0]["run_id"] +best_rmse = runs.iloc[0]["metrics.rmse"] +print(f"Best run: {best_run_id} (RMSE={best_rmse:.2f})") + +# Now register this run's model — see patterns-uc-registration.md Pattern 1 +``` + +--- + +## Common logging mistakes + +| Mistake | Effect | Fix | +|---------|--------|-----| +| No `signature` | `pyfunc.load_model` works, but `.predict()` coerces wrong | Always call `infer_signature(X_train, y_hat[:5])` | +| No `input_example` | `pyfunc.load_model` can't introspect input schema | Pass `X_train.iloc[:5]` (or `.to_numpy()[:5]` for non-pandas) | +| `artifact_path` changes between logs | Same model name → different paths → broken load URIs | Always use `artifact_path="model"` | +| Log preprocessing separately | Inference callers must reapply preprocessing manually | Wrap in a sklearn `Pipeline` and log the pipeline | +| Use `pickle.dump` directly | Loses MLflow's flavor dispatch | Always use `mlflow..log_model` | diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-uc-registration.md b/databricks-skills/databricks-mlflow-ml/references/patterns-uc-registration.md new file mode 100644 index 00000000..4d8929ed --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/patterns-uc-registration.md @@ -0,0 +1,232 @@ +# patterns-uc-registration + +Register a logged model to Unity Catalog, set aliases, verify, and handle promotion / rollback. + +--- + +## Pattern 1: Explicit register from a specific run + +Cleanest workflow. Train (separate step) → pick best run → register. + +```python +import mlflow +from mlflow import MlflowClient + +mlflow.set_registry_uri("databricks-uc") + +MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" + +# run_id from a specific training run (see patterns-training.md Pattern 6) +run_id = "abc123def456" + +result = mlflow.register_model( + model_uri=f"runs:/{run_id}/model", + name=MODEL_NAME, + tags={ + "trained_by": "forecasting_team", + "dataset_version": "2024-Q4", + }, +) +print(f"Registered {MODEL_NAME} version {result.version}") +``` + +`result` is a `ModelVersion` object: +- `result.name` — fully qualified three-level name +- `result.version` — the new version (string, e.g., `"3"`) +- `result.status` — should be `"READY"` by the time this returns + +--- + +## Pattern 2: Log-and-register in one call + +Shorter but couples logging and registration. Use when you *know* the current run is the one worth registering. + +```python +with mlflow.start_run(): + model.fit(X_train, y_train) + mlflow.sklearn.log_model( + sk_model=model, + artifact_path="model", + signature=infer_signature(X_train, model.predict(X_train[:5])), + input_example=X_train.iloc[:5], + registered_model_name="my_catalog.my_schema.grocery_forecaster", + ) + # Model is registered as a new version; you still need to set alias separately. +``` + +**Still need a separate alias call** — `log_model` doesn't set aliases. + +--- + +## Pattern 3: Set aliases (`@champion`, `@challenger`) + +Aliases decouple the loader from the version. Moving `@champion` to a new version silently updates every `models:/...@champion` loader. + +```python +from mlflow import MlflowClient +client = MlflowClient() + +# Set or move an alias +client.set_registered_model_alias( + name="my_catalog.my_schema.grocery_forecaster", + alias="champion", + version=result.version, +) +``` + +**Conventions:** +- `@champion` — the current production winner. Exactly one version at a time. +- `@challenger` — a candidate under evaluation. Exactly one at a time. +- Custom aliases — free-form, e.g., `@pair_team_07`, `@nightly`, `@reviewed`. + +**Read existing aliases:** +```python +model = client.get_registered_model("my_catalog.my_schema.grocery_forecaster") +print(model.aliases) # e.g., {"champion": "3", "challenger": "4"} +``` + +**Delete an alias:** +```python +client.delete_registered_model_alias( + name="my_catalog.my_schema.grocery_forecaster", + alias="challenger", +) +``` + +--- + +## Pattern 4: Verify registration (Navigator's V-step) + +Don't trust `register_model`'s success message alone. See `GOTCHAS.md` #5. + +### Via SQL + +```sql +DESCRIBE MODEL my_catalog.my_schema.grocery_forecaster; +``` + +Expected output includes the model metadata and (if set) aliases. If the result is "table or view not found," the model didn't register to UC — check `set_registry_uri` (GOTCHAS #1). + +### Via Catalog Explorer UI + +1. Open Catalog Explorer +2. Navigate to `my_catalog` → `my_schema` → **Models** tab +3. Confirm `grocery_forecaster` appears with an `@champion` badge + +If the model appears under the workspace MLflow icon instead (left sidebar, under MLflow), you registered to the workspace registry. See GOTCHAS #1. + +### Via Python assertion (scriptable) + +```python +from mlflow import MlflowClient +client = MlflowClient() + +model = client.get_registered_model("my_catalog.my_schema.grocery_forecaster") + +# Three assertions that should always hold post-registration +assert model is not None, "Model not registered to UC" +assert len(model.latest_versions) > 0, "No versions exist" +assert "champion" in model.aliases, "@champion alias not set" +print(f"✓ {model.name} v{model.aliases['champion']} is @champion") +``` + +--- + +## Pattern 5: A/B promotion — swap `@challenger` to `@champion` + +You've trained a new version, registered it, and validated its predictions against the current champion. Now promote: + +```python +client = MlflowClient() +MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" + +# Get current state +model = client.get_registered_model(MODEL_NAME) +old_champion = model.aliases.get("champion") +new_champion = model.aliases.get("challenger") + +if new_champion is None: + raise RuntimeError("No @challenger set — nothing to promote") + +# Move the alias (atomic — downstream loaders see the switch on next load) +client.set_registered_model_alias(MODEL_NAME, "champion", new_champion) + +# Optional: archive the old champion version with a custom alias +if old_champion: + client.set_registered_model_alias(MODEL_NAME, f"archived_{old_champion}", old_champion) + +# Remove the @challenger alias +client.delete_registered_model_alias(MODEL_NAME, "challenger") + +print(f"Promoted v{new_champion} from @challenger to @champion (was v{old_champion})") +``` + +**Rollback** is the inverse — move `@champion` back to the previous version. + +--- + +## Pattern 6: List all model versions + +Useful for lineage inspection or cleanup. + +```sql +SHOW MODEL VERSIONS ON MODEL my_catalog.my_schema.grocery_forecaster; +``` + +Or via Python: +```python +from mlflow import MlflowClient +client = MlflowClient() + +versions = client.search_model_versions( + filter_string=f"name='my_catalog.my_schema.grocery_forecaster'", + order_by=["version_number DESC"], +) +for v in versions: + print(f"v{v.version}: run_id={v.run_id}, status={v.status}, aliases={v.aliases}") +``` + +--- + +## Pattern 7: Tags — richer metadata without new versions + +Tags are key-value metadata on the registered model (or a specific version). Useful for: +- Team ownership: `set_model_version_tag(name, "1", "team", "forecasting")` +- Dataset provenance: `set_model_version_tag(name, "1", "dataset_version", "2024-Q4")` +- Review status: `set_model_version_tag(name, "1", "reviewed", "true")` + +```python +from mlflow import MlflowClient +client = MlflowClient() + +# Tag on the registered model (applies to all versions) +client.set_registered_model_tag( + name="my_catalog.my_schema.grocery_forecaster", + key="domain", + value="retail", +) + +# Tag on a specific version +client.set_model_version_tag( + name="my_catalog.my_schema.grocery_forecaster", + version="3", + key="reviewed_by", + value="jane@company.com", +) +``` + +Tags are queryable via `search_model_versions(filter_string="tags.reviewed = 'true'")`. + +--- + +## Permission requirements + +| Operation | Permission needed | Granted via | +|-----------|-------------------|-------------| +| `register_model` (first version of a model) | `CREATE MODEL ON SCHEMA ` | `GRANT CREATE MODEL ON SCHEMA ... TO ...` | +| `register_model` (new version of existing) | `EDIT ON MODEL ` | Automatic for model owner; otherwise grant | +| `set_registered_model_alias` | `EDIT ON MODEL ` | Same as above | +| `get_registered_model` / `DESCRIBE MODEL` | `USE CATALOG` + `USE SCHEMA` + `EXECUTE ON MODEL` | Standard read grants | +| `load_model` | `EXECUTE ON MODEL ` | `GRANT EXECUTE ON MODEL ... TO ...` | + +If any of these fail, request the specific grant from the schema owner. See `GOTCHAS.md` #7. diff --git a/databricks-skills/databricks-mlflow-ml/references/user-journeys.md b/databricks-skills/databricks-mlflow-ml/references/user-journeys.md new file mode 100644 index 00000000..a72f9106 --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/user-journeys.md @@ -0,0 +1,195 @@ +# user-journeys + +End-to-end workflows with decision points. Read the journey that matches your situation. + +--- + +## Journey 1: First model (train → register → score) — the 90%-case + +Most users arrive here. Goal: a UC-registered model with a `@champion` alias, producing batch predictions. + +**Prerequisites:** +- UC catalog + schema where you have `CREATE MODEL` permission +- A UC volume for MLflow artifacts (create if missing — `patterns-experiment-setup.md` Pattern 2) +- Features in a Spark table (Bronze → Silver → Gold already done) + +**Steps:** + +1. **Set up the experiment** (`patterns-experiment-setup.md` Pattern 1) + - `mlflow.set_registry_uri("databricks-uc")` + - `mlflow.set_experiment(experiment_name=..., artifact_location=)` +2. **Train + log** (`patterns-training.md` Pattern 1 or 2) + - Always include `signature` and `input_example` + - If you have preprocessing, wrap in `sklearn.Pipeline` (Pattern 2) +3. **Register** (`patterns-uc-registration.md` Pattern 1) + - `mlflow.register_model(f"runs:/{run_id}/model", "catalog.schema.model")` +4. **Set alias** (`patterns-uc-registration.md` Pattern 3) + - `client.set_registered_model_alias(name, "champion", version)` +5. **Verify** (`patterns-uc-registration.md` Pattern 4) + - `DESCRIBE MODEL catalog.schema.model` OR Catalog Explorer UI +6. **Load + score** (`patterns-batch-inference.md` Pattern 1 or 2) + - `model = mlflow.pyfunc.load_model("models:/catalog.schema.model@champion")` + - `model.predict(features_df)` + +**Done.** You have a UC-registered model with a canonical loading URI that downstream code can depend on. + +--- + +## Journey 2: Retrain + promote (A/B) + +You already have `@champion`. You trained a new version and want to decide whether to promote it. + +**Prerequisites:** +- Model exists in UC with `@champion` set (you did Journey 1) +- New training run logged to the same experiment + +**Steps:** + +1. **Register new version** (`patterns-uc-registration.md` Pattern 1) + - Same `MODEL_NAME` as before — UC auto-increments version +2. **Set `@challenger`** (`patterns-uc-registration.md` Pattern 3) + - `client.set_registered_model_alias(name, "challenger", new_version)` +3. **A/B validate** (`patterns-batch-inference.md` Pattern 5) + - Load both aliases, score validation set, compare metrics +4. **Decide**: + - Challenger wins → **Pattern 5 in `patterns-uc-registration.md`**: swap aliases + - Champion wins → delete `@challenger` alias, keep current `@champion` +5. **Verify** downstream loaders picked up the new version (after swap) + - Any code using `models:/@champion` will see the new version on next load + +--- + +## Journey 3: Lakeflow SDP batch pipeline + +You want predictions to land in a scheduled gold table, not an ad-hoc notebook. + +**Prerequisites:** +- Model registered with `@champion` (Journey 1 complete) +- Lakeflow SDP pipeline defined (one already running is ideal) + +**Steps:** + +1. **Add a new file** to the pipeline source: `src/gold/gold_forecast.py` +2. **Construct the UDF at module scope** (`patterns-batch-inference.md` Pattern 3) + - `mlflow.set_registry_uri("databricks-uc")` + - `predict_udf = mlflow.pyfunc.spark_udf(spark, "models:/...@champion", result_type="double")` +3. **Define the `@dp.materialized_view`** that reads silver features, applies the UDF +4. **Deploy + run** the pipeline + - `databricks bundle deploy && databricks bundle run ` +5. **Verify** the `gold_forecast` table materializes + - Row count matches `silver_features` + - Query from Genie or SQL editor + +**Do NOT use `ai_query`** in this pipeline — see `GOTCHAS.md` #9. + +--- + +## Journey 4: Debug a registration that went to workspace registry + +The #1 support question. Symptoms: model doesn't appear in Catalog Explorer; URL contains `/ml/models/` instead of `/explore/data/models/`. + +**Steps:** + +1. Confirm the diagnosis: + - Catalog Explorer → catalog → schema → Models tab: **missing** + - MLflow icon (left sidebar) → Models: **present** + - That's the workspace registry, not UC +2. Verify registry URI in the training session + - `mlflow.get_registry_uri()` — should return `"databricks-uc"`, not a workspace URI +3. If the URI was wrong, fix it and re-register: + - Add `mlflow.set_registry_uri("databricks-uc")` at the top of the training code + - Re-run `mlflow.register_model(...)` — this creates a new entry in UC + - The orphaned workspace-registry entry can be deleted via MLflow UI (optional) +4. Set the `@champion` alias on the new UC version +5. Verify via `DESCRIBE MODEL` — see `patterns-uc-registration.md` Pattern 4 + +--- + +## Journey 5: Debug a `pyfunc.load_model` that fails or predicts wrong + +Model loaded successfully, but `.predict()` raises or produces nonsense. + +**Steps:** + +1. **Check the signature was logged:** + ```python + from mlflow.models import get_model_info + info = get_model_info("models:/@champion") + print(info.signature) + ``` + If `None` — see `GOTCHAS.md` #8. Re-log the model with `signature=infer_signature(...)`. + +2. **Check the input column order:** + ```python + expected = model.metadata.get_input_schema().input_names() + print(f"Model expects: {expected}") + print(f"You passed: {list(features_df.columns)}") + ``` + If the order differs, pass `features_df[expected]`. + +3. **Check preprocessing coverage:** + - Does the training notebook call a scaler / encoder / imputer before fitting? + - Is that preprocessing in the logged artifact? + - If not — see `GOTCHAS.md` #12. Re-train with preprocessing wrapped in `sklearn.Pipeline`. + +4. **Check for type coercion:** + - Integer column becoming float (or vice versa) — fine for sklearn, sometimes breaks for xgboost/pytorch + - Categorical as string vs int — depends on the flavor + - Fix: cast `features_df` to match `model.metadata.get_input_schema()` dtypes before predicting + +--- + +## Journey 6: Schema evolution — your features changed since the model was logged + +The silver features pipeline added a new column. Your deployed `@champion` model was trained without it. Predictions still work (extra columns are ignored), but you want to include the new feature. + +**Steps:** + +1. Retrain with the new feature: + ```python + # Same Journey 1 steps, but with expanded feature set + mlflow.sklearn.log_model( + sk_model=new_pipeline, + artifact_path="model", + signature=infer_signature(X_train_expanded, new_pipeline.predict(X_train_expanded[:5])), + input_example=X_train_expanded.iloc[:5], + ) + ``` +2. Register as a new version +3. Validate via A/B (Journey 2) +4. Promote to `@champion` + +Schema changes are always a new version. Never mutate a logged model in place. + +--- + +## Journey 7: "Everything is on fire, I have 10 minutes to demo" + +Someone registered a fallback model. Load it. + +```python +import mlflow +mlflow.set_registry_uri("databricks-uc") +model = mlflow.pyfunc.load_model( + "models:/..@fallback" +) +features = spark.table("..sample_features").limit(500).toPandas() +features["prediction"] = model.predict(features) +display(spark.createDataFrame(features)) +``` + +Every escape-hatch pattern should pre-register a `@fallback` version for exactly this case. + +--- + +## When to use which journey + +| Situation | Journey | +|-----------|---------| +| I'm starting from zero | 1 | +| I have `@champion`, trained something new | 2 | +| I want predictions in a scheduled table | 3 | +| Registered but can't find in Catalog Explorer | 4 | +| `load_model` succeeds but `predict` fails | 5 | +| My features changed | 6 | +| Demo in 10 minutes, nothing works | 7 |