feat(databricks-skills): add databricks-mlflow-ml skill for classic ML by dgokeeffe · Pull Request #474 · databricks-solutions/ai-dev-kit

dgokeeffe · 2026-04-19T13:20:05Z

Why

The existing MLflow-related skills leave a gap for classic ML practitioners:

Skill	Scope	Covers classic ML UC registration?
`databricks-mlflow-evaluation`	GenAI agent evaluation (`mlflow.genai.evaluate`, scorers, judges)	❌ Different audience
`databricks-model-serving`	Real-time serving endpoints	❌ Covers serving, not training/registration
`databricks-unity-catalog`	Tables, volumes, system tables	❌ Data primitives, not model registry
`databricks-mlflow-ml` (this PR)	Classic ML training + UC registration + batch inference	✅

A data scientist training a forecasting model, registering it to Unity Catalog, and scoring predictions in a notebook or Lakeflow pipeline has no skill to trigger on. This PR fills that gap.

What's in the skill

SKILL.md — workflow index (Train → Register → Score, Retrain + Promote A/B, Debugging), quick-start, runtime compatibility note, and trigger description.

7 reference files:

GOTCHAS.md — 14 common mistakes with symptoms + fixes
CRITICAL-interfaces.md — exact API signatures + the models:/catalog.schema.model@alias URI format
patterns-experiment-setup.md — UC volume artifact_location (required in UC-enforced workspaces)
patterns-training.md — logging with signature + input_example, sklearn.Pipeline wrapping, autologging
patterns-uc-registration.md — three-level names, @champion/@challenger aliases, verification via DESCRIBE MODEL, A/B promotion
patterns-batch-inference.md — notebook pyfunc.load_model (Tier 1), Lakeflow SDP pyfunc.spark_udf (Tier 2), champion-vs-challenger validation, explicit warning against ai_query on custom UC models
user-journeys.md — 7 end-to-end workflows including debugging scenarios

Key gotchas this skill teaches that other guides miss

UC volume artifact_location on experiment creation — DBFS root is rejected in UC-enforced workspaces. Every log_model call fails with opaque errors until artifact_location points at a UC volume.
mlflow.set_registry_uri('databricks-uc') — without this, register_model silently routes to the legacy workspace registry. The Add initial skills for Databricks development #1 "my model isn't showing up in Catalog Explorer" support question.
ai_query on custom UC models — doesn't work. Requires a serving endpoint. Correct primitive is mlflow.pyfunc.load_model (notebook) or mlflow.pyfunc.spark_udf (Lakeflow).
@champion / @challenger aliases — replace deprecated transition_model_version_stage() stages. The legacy API still exists but is a no-op on UC-registered models (no error, no effect).
mlflow.pyfunc.spark_udf in Lakeflow SDP — must be constructed at module scope, not inside @dp.materialized_view. Otherwise deserialization repeats on every pipeline evaluation.
pip install 'mlflow[databricks]' — required for UC registration outside Databricks clusters. Plain pip install mlflow omits the cloud-storage SDKs (azure-core / boto3 / google.cloud) MLflow needs to stage UC artifacts. Clusters ship the extras pre-installed.

Testing

Field-tested end-to-end against a live Databricks workspace:

Feature table seeded, trained a GradientBoostingRegressor
Registered to UC with @champion alias — verified in Catalog Explorer UI
Loaded via mlflow.pyfunc.load_model — predictions within ~2% of actuals
Two additional gotchas surfaced during the test (mlflow[databricks] install + artifact_path deprecation) and added to GOTCHAS.md

Runtime verified: MLflow 3.11 on Lakeflow SDP serverless compute v5 (current default). Patterns compatible with MLflow 2.16+ — pairs on older classic DBRs still get correct behaviour. 2.x/3.x divergences called out in GOTCHAS.md (e.g., artifact_path → name=).

Structure parity

File layout matches databricks-mlflow-evaluation (same SKILL.md + references/ + GOTCHAS.md + CRITICAL-interfaces.md + patterns-*.md convention). Installable via the existing install_skills.sh:

./install_skills.sh databricks-mlflow-ml

Not in scope

Model Serving endpoints (databricks-model-serving covers that)
GenAI agent evaluation (databricks-mlflow-evaluation covers that)
Generic UC primitives like volumes and tables (databricks-unity-catalog covers those)

Deliberately narrow — classic ML + UC registration + batch inference only.

Origin

Built to fill a gap encountered during the Coles Vibe Workshop (airgapped Databricks field-engineer hackathon). DS pairs needed UC-scoped MLflow guidance that wasn't covered by any existing skill. Content battle-tested in the workshop before being contributed upstream.

@Challenger

Fills the gap between databricks-mlflow-evaluation (GenAI agent eval) and databricks-model-serving (real-time endpoints). Covers: - Classic ML model training with MLflow tracking (sklearn / XGBoost / PyTorch) - Experiment creation with UC volume artifact_location (required in UC-enforced workspaces) - Unity Catalog model registration with three-level names - @Champion / @Challenger alias management - Batch inference via mlflow.pyfunc.load_model (notebook, up to ~10k rows) - Distributed batch via mlflow.pyfunc.spark_udf in Lakeflow SDP pipelines Structure mirrors databricks-mlflow-evaluation: - SKILL.md: workflows + trigger description + quick start - references/GOTCHAS.md: 12 common mistakes with symptoms + fixes - references/CRITICAL-interfaces.md: exact API signatures + models:/ URI format - references/patterns-experiment-setup.md: UC volume artifact_location setup - references/patterns-training.md: logging with signature + input_example - references/patterns-uc-registration.md: register + alias + verify + A/B - references/patterns-batch-inference.md: pyfunc.load_model + spark_udf + ai_query anti-pattern - references/user-journeys.md: 7 end-to-end workflows including debugging Key gotchas covered that other MLflow guides miss: - Experiment creation now requires UC volume artifact_location in UC-enforced workspaces (DBFS root writes are rejected) - mlflow.set_registry_uri('databricks-uc') is required; silent workspace registry fallback is the databricks-solutions#1 support question - ai_query does NOT work on custom UC-registered models unless they're deployed to a serving endpoint; use pyfunc.load_model or spark_udf instead - UC aliases (@champion/@Challenger) replace deprecated stage transitions (transition_model_version_stage is a no-op on UC models) - mlflow.pyfunc.spark_udf must be constructed at module scope in Lakeflow SDP pipelines, not inside the function body Tested against MLflow 2.16+ on Databricks Runtime 15.4 LTS. Content battle- tested in the Coles Vibe Workshop (classic-ML track running in an airgapped environment where online MLflow docs aren't reachable).

Field-tested the skill end-to-end from a local Python environment against a live Databricks workspace. Surfaced two gotchas not in the original set: databricks-solutions#12 mlflow[databricks] extras missing when running outside Databricks: plain `pip install mlflow` omits azure-core / boto3 / google.cloud SDKs that UC registration needs to stage artifacts. Training + log_model work; register_model fails with opaque "No module named 'azure'". Databricks clusters ship the extras pre-installed, so this only bites laptops / CI. databricks-solutions#13 artifact_path= deprecated in favour of name= (MLflow 2.16+): emits warning on every log_model call. Non-blocking, but worth flagging since most online tutorials + training courses still use the old param. Both verified against the workshop's test run — skill workflow 1 now completes cleanly with these fixes documented.

Original SKILL.md didn't state a runtime target. Adds a "Runtime compatibility" section anchored on what the skill was actually tested against — MLflow 3.11 on Lakeflow SDP serverless compute v5 — with a compat note for MLflow 2.16+ (classic DBR 15.4 LTS still ships 2.x). Points at GOTCHAS.md for the 3.x-vs-2.x divergence (artifact_path deprecation, etc.).

dustinvannoy-db · 2026-04-24T17:11:24Z

Do the mlflow official skills we install not over this gap? cc: @jacksandom

dgokeeffe · 2026-04-27T13:13:04Z

@dustinvannoy-db I checked the mlflow/skills repo (what install_genie_code_skills.py pulls from). All 8 skills are GenAI/LLM-tracing scoped: agent-evaluation, mlflow-onboarding, instrumenting-with-mlflow-tracing, analyze-mlflow-trace, analyze-mlflow-chat-session, querying-mlflow-metrics, retrieving-mlflow-traces, searching-mlflow-docs. Not one touches Unity Catalog, set_registry_uri('databricks-uc'), @champion/@challenger aliases, or pyfunc.spark_udf.

The UC-specific stuff is what this PR covers: UC-enforced workspaces rejecting DBFS artifact roots, the legacy stage transition API silently no-oping on UC models, ai_query not working on custom UC models. That belongs here rather than upstream — it's Databricks config, not MLflow API.

David O'Keeffe added 3 commits April 19, 2026 22:01

dustinvannoy-db requested a review from jacksandom April 24, 2026 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(databricks-skills): add databricks-mlflow-ml skill for classic ML#474

feat(databricks-skills): add databricks-mlflow-ml skill for classic ML#474
dgokeeffe wants to merge 3 commits intodatabricks-solutions:mainfrom
dgokeeffe:feat/databricks-mlflow-ml-skill

dgokeeffe commented Apr 19, 2026

Uh oh!

dustinvannoy-db commented Apr 24, 2026

Uh oh!

dgokeeffe commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dgokeeffe commented Apr 19, 2026

Why

What's in the skill

Key gotchas this skill teaches that other guides miss

Testing

Structure parity

Not in scope

Origin

Uh oh!

dustinvannoy-db commented Apr 24, 2026

Uh oh!

dgokeeffe commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants