feat(databricks-skills): add databricks-mlflow-ml skill for classic ML#474
feat(databricks-skills): add databricks-mlflow-ml skill for classic ML#474dgokeeffe wants to merge 3 commits intodatabricks-solutions:mainfrom
Conversation
Fills the gap between databricks-mlflow-evaluation (GenAI agent eval) and databricks-model-serving (real-time endpoints). Covers: - Classic ML model training with MLflow tracking (sklearn / XGBoost / PyTorch) - Experiment creation with UC volume artifact_location (required in UC-enforced workspaces) - Unity Catalog model registration with three-level names - @Champion / @Challenger alias management - Batch inference via mlflow.pyfunc.load_model (notebook, up to ~10k rows) - Distributed batch via mlflow.pyfunc.spark_udf in Lakeflow SDP pipelines Structure mirrors databricks-mlflow-evaluation: - SKILL.md: workflows + trigger description + quick start - references/GOTCHAS.md: 12 common mistakes with symptoms + fixes - references/CRITICAL-interfaces.md: exact API signatures + models:/ URI format - references/patterns-experiment-setup.md: UC volume artifact_location setup - references/patterns-training.md: logging with signature + input_example - references/patterns-uc-registration.md: register + alias + verify + A/B - references/patterns-batch-inference.md: pyfunc.load_model + spark_udf + ai_query anti-pattern - references/user-journeys.md: 7 end-to-end workflows including debugging Key gotchas covered that other MLflow guides miss: - Experiment creation now requires UC volume artifact_location in UC-enforced workspaces (DBFS root writes are rejected) - mlflow.set_registry_uri('databricks-uc') is required; silent workspace registry fallback is the databricks-solutions#1 support question - ai_query does NOT work on custom UC-registered models unless they're deployed to a serving endpoint; use pyfunc.load_model or spark_udf instead - UC aliases (@champion/@Challenger) replace deprecated stage transitions (transition_model_version_stage is a no-op on UC models) - mlflow.pyfunc.spark_udf must be constructed at module scope in Lakeflow SDP pipelines, not inside the function body Tested against MLflow 2.16+ on Databricks Runtime 15.4 LTS. Content battle- tested in the Coles Vibe Workshop (classic-ML track running in an airgapped environment where online MLflow docs aren't reachable).
Field-tested the skill end-to-end from a local Python environment against a live Databricks workspace. Surfaced two gotchas not in the original set: databricks-solutions#12 mlflow[databricks] extras missing when running outside Databricks: plain `pip install mlflow` omits azure-core / boto3 / google.cloud SDKs that UC registration needs to stage artifacts. Training + log_model work; register_model fails with opaque "No module named 'azure'". Databricks clusters ship the extras pre-installed, so this only bites laptops / CI. databricks-solutions#13 artifact_path= deprecated in favour of name= (MLflow 2.16+): emits warning on every log_model call. Non-blocking, but worth flagging since most online tutorials + training courses still use the old param. Both verified against the workshop's test run — skill workflow 1 now completes cleanly with these fixes documented.
Original SKILL.md didn't state a runtime target. Adds a "Runtime compatibility" section anchored on what the skill was actually tested against — MLflow 3.11 on Lakeflow SDP serverless compute v5 — with a compat note for MLflow 2.16+ (classic DBR 15.4 LTS still ships 2.x). Points at GOTCHAS.md for the 3.x-vs-2.x divergence (artifact_path deprecation, etc.).
|
Do the mlflow official skills we install not over this gap? cc: @jacksandom |
|
@dustinvannoy-db I checked the The UC-specific stuff is what this PR covers: UC-enforced workspaces rejecting DBFS artifact roots, the legacy stage transition API silently no-oping on UC models, |
Why
The existing MLflow-related skills leave a gap for classic ML practitioners:
databricks-mlflow-evaluationmlflow.genai.evaluate, scorers, judges)databricks-model-servingdatabricks-unity-catalogdatabricks-mlflow-ml(this PR)A data scientist training a forecasting model, registering it to Unity Catalog, and scoring predictions in a notebook or Lakeflow pipeline has no skill to trigger on. This PR fills that gap.
What's in the skill
SKILL.md — workflow index (Train → Register → Score, Retrain + Promote A/B, Debugging), quick-start, runtime compatibility note, and trigger description.
7 reference files:
GOTCHAS.md— 14 common mistakes with symptoms + fixesCRITICAL-interfaces.md— exact API signatures + themodels:/catalog.schema.model@aliasURI formatpatterns-experiment-setup.md— UC volumeartifact_location(required in UC-enforced workspaces)patterns-training.md— logging withsignature+input_example,sklearn.Pipelinewrapping, autologgingpatterns-uc-registration.md— three-level names,@champion/@challengeraliases, verification viaDESCRIBE MODEL, A/B promotionpatterns-batch-inference.md— notebookpyfunc.load_model(Tier 1), Lakeflow SDPpyfunc.spark_udf(Tier 2), champion-vs-challenger validation, explicit warning againstai_queryon custom UC modelsuser-journeys.md— 7 end-to-end workflows including debugging scenariosKey gotchas this skill teaches that other guides miss
artifact_locationon experiment creation — DBFS root is rejected in UC-enforced workspaces. Everylog_modelcall fails with opaque errors untilartifact_locationpoints at a UC volume.mlflow.set_registry_uri('databricks-uc')— without this,register_modelsilently routes to the legacy workspace registry. The Add initial skills for Databricks development #1 "my model isn't showing up in Catalog Explorer" support question.ai_queryon custom UC models — doesn't work. Requires a serving endpoint. Correct primitive ismlflow.pyfunc.load_model(notebook) ormlflow.pyfunc.spark_udf(Lakeflow).@champion/@challengeraliases — replace deprecatedtransition_model_version_stage()stages. The legacy API still exists but is a no-op on UC-registered models (no error, no effect).mlflow.pyfunc.spark_udfin Lakeflow SDP — must be constructed at module scope, not inside@dp.materialized_view. Otherwise deserialization repeats on every pipeline evaluation.pip install 'mlflow[databricks]'— required for UC registration outside Databricks clusters. Plainpip install mlflowomits the cloud-storage SDKs (azure-core / boto3 / google.cloud) MLflow needs to stage UC artifacts. Clusters ship the extras pre-installed.Testing
Field-tested end-to-end against a live Databricks workspace:
GradientBoostingRegressor@championalias — verified in Catalog Explorer UImlflow.pyfunc.load_model— predictions within ~2% of actualsmlflow[databricks]install +artifact_pathdeprecation) and added to GOTCHAS.mdRuntime verified: MLflow 3.11 on Lakeflow SDP serverless compute v5 (current default). Patterns compatible with MLflow 2.16+ — pairs on older classic DBRs still get correct behaviour. 2.x/3.x divergences called out in GOTCHAS.md (e.g.,
artifact_path→name=).Structure parity
File layout matches
databricks-mlflow-evaluation(sameSKILL.md+references/+GOTCHAS.md+CRITICAL-interfaces.md+patterns-*.mdconvention). Installable via the existinginstall_skills.sh:Not in scope
databricks-model-servingcovers that)databricks-mlflow-evaluationcovers that)databricks-unity-catalogcovers those)Deliberately narrow — classic ML + UC registration + batch inference only.
Origin
Built to fill a gap encountered during the Coles Vibe Workshop (airgapped Databricks field-engineer hackathon). DS pairs needed UC-scoped MLflow guidance that wasn't covered by any existing skill. Content battle-tested in the workshop before being contributed upstream.