Skip to content

Add GenieSpaceBuilder + authoring walkthrough to databricks-genie skill#495

Open
KabeerThockchom wants to merge 2 commits intodatabricks-solutions:mainfrom
KabeerThockchom:feature/genie-space-builder
Open

Add GenieSpaceBuilder + authoring walkthrough to databricks-genie skill#495
KabeerThockchom wants to merge 2 commits intodatabricks-solutions:mainfrom
KabeerThockchom:feature/genie-space-builder

Conversation

@KabeerThockchom
Copy link
Copy Markdown

Summary

Adds a typed authoring API for Genie Space serialized_space payloads plus a skill-level walkthrough covering the full authoring pipeline.

  • databricks-mcp-server/databricks_mcp_server/tools/genie_space_builder.py (new) — GenieSpaceBuilder class with path constants and add_* / replace_* / find_by_id / to_json / from_json helpers for every serialized_space slot: tables, metric_views, column_configs, sample_questions, text_instructions, example_question_sqls, join_specs, sql_snippets (filters/expressions/measures), and benchmarks. Handles 32-char UUID/hex IDs per API spec. Preserves unknown fields on round-trip. Pure Python, no network calls, no LLM dependencies.
  • databricks-skills/databricks-genie/spaces-authoring.md (new) — 7-step authoring walkthrough: scan metadata → metric views (dimensions + measures via UC semantic layer YAML) → joins → table/column descriptions → sample questions with certified SQL → reusable snippets → text instructions + benchmarks. Shows round-trip export/modify/import.
  • databricks-skills/databricks-genie/SKILL.md — adds a reference link to the new doc.
  • databricks-mcp-server/tests/test_genie_space_builder.py (new) — 25 unit tests covering round-trip fidelity, data-source management, column_configs replacement + sort order, all instructions slots, snippet field stripping, benchmark structure, and generic find/replace/remove_by_id operations.

Context

The existing databricks-genie skill supports creating and migrating spaces well, but the only ergonomic path to populate the rich serialized_space slots today is hand-crafting JSON or round-tripping an exported space. Users building new spaces through the skill tend to ship thin spaces (tables + sample questions only) and fill the rest via the UI.

This PR makes the full schema authorable from code. It complements #473@sean-zhang-dbx documents the schema (references/schema.md, references/best-practices.md) and this PR makes the documented schema ergonomic to fill. Docs can cross-reference once both land.

Scope is deliberately minimal: authoring helper + walkthrough only. No changes to existing MCP tools. No LLM pipeline (the walkthrough references sunnysingh-db/ai-genie-space-generator as a reference implementation for anyone who wants to drive the builder from an LLM).

Cleared with @calreynolds in #ai-dev-kit before opening.

Test plan

  • Ruff lint passes (--select=E,F,B,PIE per CONTRIBUTING)
  • Ruff format passes
  • 25/25 unit tests pass (pytest databricks-mcp-server/tests/test_genie_space_builder.py)
  • Round-trip fidelity verified: from_json(to_envelope()) reconstructs an equivalent builder
  • Unknown-field preservation verified
  • Doc examples verified to reference only existing MCP tool signatures (manage_genie, execute_sql, get_table_stats_and_schema)
  • End-to-end smoke test against a live workspace (reviewer to verify — builder-built envelope passed through manage_genie(action="import"))

This pull request and its description were written by Kabeer with Claude assistance.

…kill

- databricks-mcp-server/databricks_mcp_server/tools/genie_space_builder.py:
  typed authoring API over the serialized_space payload. Covers tables,
  metric_views, column_configs, sample_questions, text_instructions,
  example_question_sqls, join_specs, sql_snippets (filters/expressions/
  measures), and benchmarks. Preserves unknown fields on round-trip.
- databricks-skills/databricks-genie/spaces-authoring.md: 7-step walkthrough
  for building rich spaces via the builder, covering scan, metric views,
  joins, descriptions, sample questions with SQL, snippets, benchmarks.
- databricks-skills/databricks-genie/SKILL.md: link to the new doc.
- 25 unit tests (no network / workspace dependencies).

Complements PR databricks-solutions#473 (serialized_space schema documentation) by making
the schema ergonomic to author from code.
Verified via end-to-end smoke test against a live workspace
(/api/2.0/genie/spaces). Original schema was modeled after older
implementations (sunnysingh-db/ai-genie-space-generator,
fe-internal-tools:genie-rooms) and didn't match the current proto.

Wire-format corrections:

- join_specs: use {left, right}: {identifier, alias} objects (not
  flat left_table/right_table strings); store the join condition in
  a `sql` list with relationship_type encoded as
  `--rt=FROM_RELATIONSHIP_TYPE_X--`. Drop the `join_type` field
  (the proto does not accept it). Validate relationship_type against
  ONE_TO_ONE / ONE_TO_MANY / MANY_TO_ONE / MANY_TO_MANY.
- sql_snippets.measures: use `alias` (not `name`).
- sql_snippets.{filters,measures,expressions}: store `sql` and
  `comment` as [str] lists (the wire format).
- benchmarks.questions: use `answer: [{format, content}]` (not
  `answers: [{format, body}]`).

Also normalise emit-time: tables and metric_views must be sorted by
identifier; column_configs must be sorted by column_name; id-keyed
lists are sorted by id. Sorting is applied in to_dict() / to_json()
so users can add entries in any order.

Tests now assert against the real wire format. 26/26 pass.
Smoke test (not in PR) verifies a builder-built envelope is accepted
by /api/2.0/genie/spaces and round-trips through GET.
@KabeerThockchom
Copy link
Copy Markdown
Author

End-to-end smoke test complete against a live workspace — checked off the last item in the test plan.

Pushed commit 423b1b6 with schema corrections discovered during the smoke test. The original payload structure (modeled after sunnysingh-db/ai-genie-space-generator and the internal fe-internal-tools:genie-rooms builder) didn't match the current Genie API proto. Specific fixes:

  • join_specs: use {left, right}: {identifier, alias} objects (not flat left_table/right_table strings); join condition goes in a sql list with relationship_type encoded as --rt=FROM_RELATIONSHIP_TYPE_X--. Dropped the join_type field — the proto rejects it. Added validation against the four valid relationship types.
  • sql_snippets.measures: API field is alias, not name.
  • sql_snippets.{filters,measures,expressions}: sql and comment are [str] lists, not strings.
  • benchmarks.questions: uses answer: [{format, content}] (not answers: [{format, body}]).
  • Sort constraints: data_sources.tables and data_sources.metric_views must be sorted by identifier; column_configs by column_name; id-keyed lists by id. Now normalized at emit time so users can add entries in any order.

Smoke test result: built a 3-table / 3-sample-Q / 1-example-SQL / 2-join / 3-snippet / 1-text-instruction / 1-benchmark payload via the builder, POSTed to /api/2.0/genie/spaces, fetched it back, and deleted it. Test is not in the PR (it's a one-off probe pinned to a specific catalog) but available locally.

Tests: 26/26 still passing. Lint + format clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant