diff --git a/databricks-skills/databricks-genie/SKILL.md b/databricks-skills/databricks-genie/SKILL.md index 60d16c14..ab24eb71 100644 --- a/databricks-skills/databricks-genie/SKILL.md +++ b/databricks-skills/databricks-genie/SKILL.md @@ -98,6 +98,8 @@ ask_genie( - [spaces.md](spaces.md) - Creating and managing Genie Spaces - [conversation.md](conversation.md) - Asking questions via the Conversation API +- [references/schema.md](references/schema.md) - `serialized_space` JSON schema and field reference +- [references/best-practices.md](references/best-practices.md) - Instruction authoring, prompt matching, benchmarks, troubleshooting, and validation checklist ## Prerequisites diff --git a/databricks-skills/databricks-genie/references/best-practices.md b/databricks-skills/databricks-genie/references/best-practices.md new file mode 100644 index 00000000..012c61af --- /dev/null +++ b/databricks-skills/databricks-genie/references/best-practices.md @@ -0,0 +1,235 @@ +# Genie Space Best Practices + +Best practices for creating, configuring, and maintaining high-quality Genie spaces. Covers instruction authoring, prompt matching, benchmarks, troubleshooting, and a pre-creation validation checklist. + +## Instruction Priority + +Instructions help Genie accurately interpret business questions and generate correct SQL. Prioritize SQL-based instructions over text — they are more precise and easier for Genie to apply consistently. + +**Priority order (most to least effective):** + +1. **SQL Expressions** — for common business terms (metrics, filters, dimensions) +2. **Example SQL Queries** — for complex, multi-part, or hard-to-interpret questions +3. **Text Instructions** — for general guidance that doesn't fit structured SQL definitions + +A Genie space supports up to **100 instructions total**, counted as: each example SQL query = 1, each SQL function = 1, the entire text instructions block = 1. + +## SQL Expressions + +Use SQL expressions to define frequently used business terms as reusable definitions. These are stored in `instructions.sql_snippets` in the `serialized_space` configuration (see [schema.md](schema.md)). + +**Three types:** + +- **Measures** (`sql_snippets.measures`): KPIs and aggregation metrics + ```json + {"id": "...", "alias": "total_revenue", "sql": ["SUM(orders.quantity * orders.unit_price)"]} + ``` +- **Filters** (`sql_snippets.filters`): Common filtering conditions (boolean expression — do **not** include the `WHERE` keyword) + ```json + {"id": "...", "display_name": "high value", "sql": ["orders.amount > 1000"]} + ``` +- **Dimensions** (`sql_snippets.expressions`): Attributes for grouping and analysis + ```json + {"id": "...", "alias": "order_year", "sql": ["YEAR(orders.order_date)"]} + ``` + +**Formatting rules:** +- The `sql` field is a **string array** (`string[]`). Wrap the SQL fragment in an array (e.g., `["SUM(orders.amount)"]`). The API rejects plain strings. +- **All column references must be table-qualified** (`table_name.column_name`). The Genie UI rejects bare column names with "Table name or alias is required for column '...'". +- Filters must **NOT** include the `WHERE` keyword — only the boolean condition. + +**Good candidates:** metrics (gross margin, conversion rate), filters ("active customer", "high-value order"), dimensions (fiscal quarter, product category groupings). + +## Example SQL Queries + +Use complete example SQL queries for hard-to-interpret, multi-part, or complex questions. Good candidates include questions requiring complex joins, multi-step calculations, and domain-specific aggregations. + +**Use one question per SQL entry.** Each example SQL query should map to exactly one natural language question. For multiple phrasings of the same question, create separate entries with the same SQL. + +**Line-split format for `sql`:** Each SQL clause should be a **separate string element** in the array with `\n` at the end. Never concatenate clauses into one string. + +```json +{ + "question": ["What are total sales by product category?"], + "sql": [ + "SELECT\n", + " p.category,\n", + " SUM(o.quantity * o.unit_price) as total_sales\n", + "FROM catalog.schema.orders o\n", + "JOIN catalog.schema.products p ON o.product_id = p.product_id\n", + "GROUP BY p.category\n", + "ORDER BY total_sales DESC" + ] +} +``` + +### Parameterized Queries + +Add parameters using `:parameter_name` syntax. Parameterized queries become **trusted assets** (labeled "Trusted" in responses). Use for recurring questions where users specify different filter values. Parameter types: String (default), Date, Date and Time, Numeric. Use static queries for questions that don't vary or to teach Genie general patterns. + +## Text Instructions + +Reserve text instructions for context that doesn't fit SQL definitions. Keep them concise and specific. + +**Good text instructions:** +- "Active customer" means a customer with at least one order in the last 90 days +- Revenue should always be calculated as quantity * unit_price * (1 - discount) +- Fiscal year starts April 1st +- All monetary values are in USD unless otherwise specified + +**Avoid vague instructions.** Instead of "Ask clarification questions when asked about sales," write: +> "When users ask about sales metrics without specifying product name or sales channel, ask: 'To proceed with sales analysis, please specify your product name and sales channel.'" + +Ensure consistency across all instruction types. If text instructions specify rounding decimals to two digits, example SQL queries must also round to two digits. + +### Clarification Questions + +Structure clarification instructions with: trigger condition ("When users ask about X..."), missing details ("...but don't include Y..."), required action ("...ask a clarification question first..."), and an example question. + +### Summary Customization + +Add a dedicated section at the end of text instructions with the heading **"Instructions you must follow when providing summaries"** to control how Genie generates natural language summaries alongside query results. Only text instructions affect summary generation. + +## Join Specs + +Define table relationships in `instructions.join_specs` when foreign keys are not defined in Unity Catalog. + +**Critical format requirement:** The `sql` array must contain **two elements**: +1. The join condition using backtick-quoted alias references: `` `orders`.`customer_id` = `customers`.`customer_id` `` +2. A **relationship type annotation** — one of: + - `"--rt=FROM_RELATIONSHIP_TYPE_MANY_TO_ONE--"` + - `"--rt=FROM_RELATIONSHIP_TYPE_ONE_TO_MANY--"` + - `"--rt=FROM_RELATIONSHIP_TYPE_ONE_TO_ONE--"` + - `"--rt=FROM_RELATIONSHIP_TYPE_MANY_TO_MANY--"` + +Without the `--rt=...--` annotation, the API rejects the request with a protobuf parsing error. See [schema.md](schema.md) for the full field reference. + +**Priority for defining table relationships:** +1. Define foreign keys in Unity Catalog (most reliable) +2. Define join specs in the `serialized_space` via the API +3. Define join relationships in the Genie space UI (Configure > Knowledge store) +4. Provide example SQL queries with correct joins (effective fallback) +5. Pre-join tables into views (last resort) + +## Column Configuration and Prompt Matching + +"Prompt matching" is the umbrella term for two features: +- **Format assistance** — provides representative values so Genie understands data types and formatting patterns +- **Entity matching** — maps user terms to actual column values (e.g., "California" → "CA"). Requires `enable_format_assistance: true`. + +### API vs UI Behavior + +When tables are added via the Genie space UI, both features are auto-enabled for eligible columns. **When creating spaces via the API, prompt matching is OFF by default.** You must explicitly include `column_configs` entries with `enable_format_assistance: true` and `enable_entity_matching: true` for each column that needs it. + +After creating a space via API, open it in the UI and verify prompt matching is enabled for key filter columns (Configure > Data > column > Advanced settings). + +### Entity Matching Limits + +- Up to 120 columns per space +- Up to 1,024 distinct values per column (max 127 characters per value) +- Tables with row filters or column masks are excluded + +### When to Disable + +Turn off format assistance (and entity matching) on columns that are excluded (`exclude: true`) or on high-cardinality freetext columns where entity matching adds no value. + +### Hiding Columns + +Set `"exclude": true` in `column_configs` to hide a column from Genie. Use for internal IDs, ETL timestamps, or columns not useful for business questions. **Never hide columns without explicit user approval** — do not infer which columns to hide based on column names. + +## Benchmarks + +Every new space should include benchmarks in its initial configuration. Target **10-20 benchmark questions**. + +### Core Benchmarks (high expected accuracy: 80-100%) + +For each example SQL query, include the original question as a smoke test plus 2-3 alternate phrasings. Ground truth SQL = the exact same SQL from the corresponding `example_question_sqls` entry (reuse verbatim). + +### Stretch Benchmarks (lower expected accuracy) + +New questions covering sample questions or other use cases with no corresponding example SQL. Ground truth SQL is independently written but follows the same conventions. + +### Writing Good Benchmarks + +- **Questions must be unambiguous.** Include the exact metric, grouping, count, and scope so the ground truth SQL is the only reasonable interpretation. Bad: "Show me the most lethal cancers." Good: "What are the top 5 cancer types ranked by average mortality rate?" +- **Ground truth SQL must be minimal.** Only include columns and clauses directly implied by the question. Extra columns cause benchmark failures. + +### Interpreting Results + +| Rating | Condition | +|--------|-----------| +| **Good** | Generated SQL or result set matches ground truth (including different sort order or numeric values matching to 4 significant digits) | +| **Bad** | Empty result set, error, extra columns, or different single-cell result | +| **Manual review** | Genie couldn't assess, or no SQL answer was provided | + +## Troubleshooting Decision Tree + +Use this decision tree when a user reports a specific problem with their Genie space. + +### "Genie uses the wrong table or column" +1. Check table/column descriptions — do they match user terminology? +2. Look for overlapping column names across tables +3. Fix: Add example SQL queries showing correct usage, hide confusing columns + +### "Genie misunderstands our terminology" +1. Check if the term is defined in text instructions or SQL expressions +2. Check column synonyms in the knowledge store +3. Fix: Add a SQL expression or text instruction mapping the term to the correct data concept + +### "Genie filters on wrong values" (e.g., "California" vs "CA") +1. Check if entity matching and format assistance are enabled for the column (Configure > Data > column > Advanced settings) +2. Check if prompt matching data is up to date (kebab menu > Refresh prompt matching) +3. Fix: Enable entity matching (requires format assistance), refresh values if data changed + +### "Genie joins tables incorrectly" +1. Check for foreign key constraints in Unity Catalog +2. Check join relationships in the knowledge store +3. Fix: Define join relationships or add example SQL queries with correct joins + +### "Metric calculations are wrong" +1. Check if the metric is defined as a SQL expression +2. Check if there's an example SQL query computing it correctly +3. Check for pre-aggregated tables that might be double-counted +4. Fix: Add SQL expressions for metrics, or example SQL for complex calculations + +### "Timezone/date calculations are wrong" +1. Check text instructions for timezone guidance +2. Fix: Add explicit instructions like "Time zones are in UTC. Convert using `convert_timezone('UTC', 'America/Los_Angeles', )`." + +### "Genie ignores my instructions" +1. Check for conflicting instructions across types +2. Check if the instruction count is high (noise drowns out signal) +3. Fix: Add example SQL (most effective), hide irrelevant columns, simplify instruction set, start a new chat for testing + +### "Responses are slow or timing out" +1. Check query history for slow-running queries +2. Fix: Use trusted assets for complex logic, reduce example SQL length, start new chat + +## Validation Checklist + +Before creating or updating a space, verify: + +- [ ] **Title** is a clear, descriptive name (not empty, not generic like "Untitled") +- [ ] **Description** is a one-sentence summary of the space's purpose +- [ ] Space has a **clearly defined purpose** for a specific topic and audience +- [ ] At least one valid Unity Catalog table is specified +- [ ] Tables are **focused** — ideally 5 or fewer, maximum 25 +- [ ] Tables exist and user has SELECT permission +- [ ] **Actual column names and values have been inspected** (run `DESCRIBE TABLE` and `SELECT DISTINCT` on key columns) +- [ ] Column names and descriptions are clear and well-annotated in Unity Catalog +- [ ] Warehouse ID is valid and is a pro or serverless SQL warehouse +- [ ] Parent path exists in workspace +- [ ] Sample questions are business-friendly and cover common use cases +- [ ] **SQL expressions** are defined for key metrics, filters, and dimensions, with table-qualified column references +- [ ] **Example SQL queries** are included for complex or multi-step questions +- [ ] **All example SQL queries have been executed** and return valid results +- [ ] **Text instructions** are concise, specific, and non-conflicting +- [ ] Instructions across all types are consistent (same rounding, same date conventions) +- [ ] **`column_configs`** include `enable_format_assistance: true` and `enable_entity_matching: true` for key filter columns (prompt matching is NOT auto-enabled via API) +- [ ] **No columns are excluded** unless the user explicitly approved them +- [ ] **Benchmarks** are included with 10-20 questions and SQL ground truth +- [ ] All IDs are 32-char lowercase hex, unique, and arrays are sorted by `id` / `identifier` +- [ ] `sql` fields are string arrays (not plain strings) +- [ ] Filter snippets do not include the `WHERE` keyword +- [ ] `text_instructions.content` elements end with `\n` + diff --git a/databricks-skills/databricks-genie/references/schema.md b/databricks-skills/databricks-genie/references/schema.md new file mode 100644 index 00000000..43b0a7b4 --- /dev/null +++ b/databricks-skills/databricks-genie/references/schema.md @@ -0,0 +1,299 @@ +# serialized_space JSON Schema + +Complete structure for the `serialized_space` configuration. Include only sections relevant to the user's space. + +## Contents +- Example Structure (full JSON) +- Field Reference: config, data_sources, instructions, benchmarks +- Prompt Matching Overview (format assistance, entity matching, limits) +- Important Notes (formatting rules, sorting, common mistakes) +- ID Generation + +```json +{ + "version": 2, + "config": { + "sample_questions": [ + { + "id": "a1b2c3d4e5f60000000000000000000a", + "question": ["What were total sales last month?"] + } + ] + }, + "data_sources": { + "tables": [ + { + "identifier": "catalog.schema.orders", + "description": ["Daily sales transactions with line-item details"], + "column_configs": [ + { + "column_name": "region", + "description": ["Sales region code: AMER, EMEA, APJ, LATAM"], + "synonyms": ["area", "territory", "sales region"], + "exclude": false, + "enable_entity_matching": true, + "enable_format_assistance": true + }, + { + "column_name": "etl_timestamp", + "exclude": true, + "enable_entity_matching": false, + "enable_format_assistance": false + } + ] + }, + { + "identifier": "catalog.schema.products", + "column_configs": [ + { + "column_name": "category", + "enable_entity_matching": true, + "enable_format_assistance": true + } + ] + } + ], + "metric_views": [ + { + "identifier": "catalog.schema.revenue_metrics", + "description": ["Revenue metrics"] + } + ] + }, + "instructions": { + "text_instructions": [ + { + "id": "b2c3d4e5f6a70000000000000000000b", + "content": ["Revenue = quantity * unit_price.\n", "Fiscal year starts April 1st."] + } + ], + "example_question_sqls": [ + { + "id": "c3d4e5f6a7b80000000000000000000c", + "question": ["Show top 10 customers by revenue"], + "sql": ["SELECT\n", " customer_name,\n", " SUM(amount) as total\n", "FROM catalog.schema.orders\n", "GROUP BY customer_name\n", "ORDER BY total DESC\n", "LIMIT 10"], + "usage_guidance": ["Use this pattern for any top-N ranking question by a numeric metric"] + } + ], + "sql_functions": [ + { + "id": "d4e5f6a7b8c90000000000000000000d", + "identifier": "catalog.schema.fiscal_quarter", + "description": "Calculates the fiscal quarter from a date (fiscal year starts April 1)" + } + ], + "join_specs": [ + { + "id": "e5f6a7b8c9d00000000000000000000e", + "left": {"identifier": "catalog.schema.orders", "alias": "orders"}, + "right": {"identifier": "catalog.schema.customers", "alias": "customers"}, + "sql": [ + "`orders`.`customer_id` = `customers`.`customer_id`", + "--rt=FROM_RELATIONSHIP_TYPE_MANY_TO_ONE--" + ], + "comment": ["Join orders to customers on customer_id"], + "instruction": ["Use this join when relating orders to customer demographics"] + } + ], + "sql_snippets": { + "filters": [ + { + "id": "f6a7b8c9d0e10000000000000000000f", + "display_name": "high value", + "sql": ["orders.amount > 1000"], + "synonyms": ["big deal", "large order"], + "instruction": ["Apply when users ask about high-value or large orders"], + "comment": ["Threshold aligned with finance team's definition"] + } + ], + "expressions": [ + { + "id": "a7b8c9d0e1f20000000000000000000a", + "alias": "order_year", + "display_name": "Order Year", + "sql": ["YEAR(orders.order_date)"], + "synonyms": ["year"], + "instruction": ["Use for any year-based grouping of orders"], + "comment": ["Standard date dimension for annual reporting"] + } + ], + "measures": [ + { + "id": "b8c9d0e1f2a30000000000000000000b", + "alias": "total_revenue", + "display_name": "Total Revenue", + "sql": ["SUM(orders.quantity * orders.unit_price)"], + "synonyms": ["revenue", "sales", "total sales"], + "instruction": ["Use for any revenue aggregation"], + "comment": ["Revenue includes all non-cancelled order line items"] + } + ] + } + }, + "benchmarks": { + "questions": [ + { + "id": "c9d0e1f2a3b40000000000000000000c", + "question": ["What is average order value?"], + "answer": [{"format": "SQL", "content": ["SELECT AVG(amount) FROM catalog.schema.orders"]}] + } + ] + } +} +``` + +--- + +## Field Reference + +### config + +| Field | Type | Description | +|-------|------|-------------| +| `config.sample_questions[].id` | string | 32-char lowercase hex ID | +| `config.sample_questions[].question` | string[] | Single question string per entry | + +### data_sources + +| Field | Type | Description | +|-------|------|-------------| +| `data_sources.tables[].identifier` | string | Fully qualified table name (`catalog.schema.table`) | +| `data_sources.tables[].description` | string[] | Space-scoped description override (optional) | +| `data_sources.tables[].column_configs[]` | array | Per-column configuration (optional — set via API or manage flow) | +| `data_sources.tables[].column_configs[].column_name` | string | Column name | +| `data_sources.tables[].column_configs[].description` | string[] | Contextual description beyond the column name | +| `data_sources.tables[].column_configs[].synonyms` | string[] | Alternative names users might use for this column | +| `data_sources.tables[].column_configs[].exclude` | boolean | Hide this column from Genie (default: false) | +| `data_sources.tables[].column_configs[].enable_format_assistance` | boolean | **(v2 only)** Provide representative values so Genie understands data types and formatting. Auto-enabled via UI but **OFF by default via API** — must be set explicitly. Must be `true` for entity matching to work. | +| `data_sources.tables[].column_configs[].enable_entity_matching` | boolean | **(v2 only)** Match user terms to actual column values (e.g., "California" → "CA"). Auto-enabled via UI but **OFF by default via API** — must be set explicitly. Requires `enable_format_assistance: true`. Supports up to 120 columns, 1,024 distinct values per column (max 127 chars each). | +| `data_sources.metric_views[].identifier` | string | Fully qualified metric view name | +| `data_sources.metric_views[].description` | string[] | What the metric view computes | +| `data_sources.metric_views[].column_configs[]` | array | Per-column configuration (optional — same fields as table column_configs) | + +> **v1 vs v2:** Spaces with `"version": 2` reject v1 column_config fields. Use the v2 equivalents instead: +> - `get_example_values` (v1) → `enable_format_assistance` (v2) +> - `build_value_dictionary` (v1) → `enable_entity_matching` (v2) +> +> Including v1 fields in a v2 space will cause API errors. Always use the v2 field names. + +> **Prompt matching overview:** "Prompt matching" is the umbrella term for two features that work together: +> - **Format assistance** — provides representative values so Genie understands data types and formatting patterns. +> - **Entity matching** — provides curated lists of distinct values so Genie can map user terms to actual data (e.g., "California" → "CA"). Requires `enable_format_assistance: true` (turning off format assistance automatically disables entity matching). +> +> **UI vs API behavior:** When tables are added via the Genie space UI, both features are auto-enabled for eligible columns. **When creating spaces via the API, prompt matching is OFF by default.** You must explicitly include `column_configs` entries with `enable_format_assistance: true` and `enable_entity_matching: true` for each column that needs it. Columns not listed in `column_configs` will not have prompt matching. +> +> **Post-creation recommendation:** After creating a space via API, open it in the UI and verify that prompt matching is enabled for all key filter columns (Configure > Data > column > Advanced settings). You can also enable it for additional columns at that point. +> +> **Entity matching limits:** Up to 120 columns per space, 1,024 distinct values per column, max 127 characters per value. Tables with row filters or column masks are excluded from prompt matching. +> +> **When to disable:** Turn off format assistance (and entity matching) on columns that are excluded (`exclude: true`) or on high-cardinality freetext columns where entity matching adds no value. + +### instructions + +| Field | Type | Description | +|-------|------|-------------| +| `instructions.text_instructions[].id` | string | 32-char hex ID | +| `instructions.text_instructions[].content` | string[] | Instruction text segments. Max 1 text instruction per space. **Important:** The API concatenates array elements without separators — each element must end with `\n` or a space to avoid jammed text. | +| `instructions.example_question_sqls[].id` | string | 32-char hex ID | +| `instructions.example_question_sqls[].question` | string[] | Single natural language question | +| `instructions.example_question_sqls[].sql` | string[] | SQL query split by lines with `\n` | +| `instructions.example_question_sqls[].usage_guidance` | string[] | When Genie should apply this pattern (optional but recommended) | +| `instructions.example_question_sqls[].parameters` | array | Parameterized values (optional) | +| `instructions.example_question_sqls[].parameters[].name` | string | Parameter name (matches `:name` in SQL) | +| `instructions.example_question_sqls[].parameters[].description` | string[] | What the parameter represents | +| `instructions.example_question_sqls[].parameters[].type_hint` | string | Data type hint (e.g., `"STRING"`, `"INT"`, `"DATE"`) | +| `instructions.example_question_sqls[].parameters[].default_value` | object | Default value with `values` array (e.g., `{"values": ["month"]}`) | +| `instructions.sql_functions[].id` | string | 32-char hex ID | +| `instructions.sql_functions[].identifier` | string | Fully qualified UDF name | +| `instructions.sql_functions[].description` | string | What the function does (plain string, not array) | +| `instructions.join_specs[].id` | string | 32-char hex ID | +| `instructions.join_specs[].left` | object | Left table: `{"identifier": "catalog.schema.table", "alias": "short_name"}`. `alias` is recommended. | +| `instructions.join_specs[].right` | object | Right table: `{"identifier": "catalog.schema.table", "alias": "short_name"}`. `alias` is recommended. | +| `instructions.join_specs[].sql` | string[] | **Two elements required:** (1) Join condition using backtick-quoted alias references (e.g., `` `orders`.`customer_id` = `customers`.`customer_id` ``), (2) Relationship type annotation (e.g., `"--rt=FROM_RELATIONSHIP_TYPE_MANY_TO_ONE--"`). | +| `instructions.join_specs[].comment` | string[] | Business context for the relationship (optional) | +| `instructions.join_specs[].instruction` | string[] | Guidance on when to use this join (optional) | + +> **Critical `join_specs` format requirement:** The `sql` array must contain **two elements**: +> 1. The join condition — use backtick-quoted alias or table name references: `` "`alias`.`column` = `alias`.`column`" `` +> 2. A **relationship type annotation** — one of: +> - `"--rt=FROM_RELATIONSHIP_TYPE_MANY_TO_ONE--"` +> - `"--rt=FROM_RELATIONSHIP_TYPE_ONE_TO_MANY--"` +> - `"--rt=FROM_RELATIONSHIP_TYPE_ONE_TO_ONE--"` +> - `"--rt=FROM_RELATIONSHIP_TYPE_MANY_TO_MANY--"` +> +> **Without the `--rt=...--` annotation, the API rejects the request** with a protobuf parsing error. This annotation is required even though it is not documented in the official API reference. + +### sql_snippets (shared fields) + +All three snippet types (`filters`, `expressions`, `measures`) support these optional fields in addition to their required ones: + +| Field | Type | Description | +|-------|------|-------------| +| `synonyms` | string[] | Alternative terms that trigger this snippet | +| `instruction` | string[] | When Genie should apply this snippet | +| `comment` | string[] | Additional context or notes | +| `display_name` | string | Human-readable name (required for filters, optional for measures/expressions) | + +**Type-specific required fields:** + +| Type | Required Fields | +|------|----------------| +| `sql_snippets.filters[]` | `id`, `sql` (string[]), `display_name` | +| `sql_snippets.expressions[]` | `id`, `sql` (string[]), `alias` | +| `sql_snippets.measures[]` | `id`, `sql` (string[]), `alias` | + +> **Important:** The `sql` field in `sql_snippets` is a **string array** (`string[]`), the same format as `example_question_sqls[].sql`. Wrap the SQL fragment in a single-element array (e.g., `["SUM(orders.amount)"]`). The API rejects plain strings. +> +> **Column references must be table-qualified** (`table_name.column_name`) in all snippet types. The API accepts bare column names, but the Genie UI rejects them with "Table name or alias is required for column '...'". Filters must also omit the `WHERE` keyword — provide only the boolean condition. + +### benchmarks + +| Field | Type | Description | +|-------|------|-------------| +| `benchmarks.questions[].id` | string | 32-char hex ID (must be unique across `sample_questions` and `benchmarks.questions`) | +| `benchmarks.questions[].question` | string[] | Test question | +| `benchmarks.questions[].answer` | array | Expected answers. Find `format: "SQL"` for SQL benchmarks. | +| `benchmarks.questions[].answer[].format` | string | Answer format (typically `"SQL"`) | +| `benchmarks.questions[].answer[].content` | string[] | Expected SQL query segments | + +--- + +## Important Notes + +- `version`: **Required**. Use `2` for new spaces +- `question`: Must be an array with a **single question string** per entry +- `sql` (in `example_question_sqls`): Must be an **array of strings**. Each SQL clause should be a separate element with `\n` at the end: + - Correct: `["SELECT\n", " col1,\n", " col2\n", "FROM table\n", "WHERE col1 > 0"]` + - Wrong: `["SELECT col1, col2FROM tableWHERE col1 > 0"]` +- `sql` (in `sql_snippets`): Must be a **string array** (`string[]`), same format as `example_question_sqls`. Wrap the SQL fragment in an array: + - Correct: `["SUM(quantity * unit_price)"]` + - Wrong: `"SUM(quantity * unit_price)"` (plain string — API rejects this) +- **SQL snippets require table-qualified column references** (`table_name.column`). The API accepts bare column names, but the **UI rejects them** with "Table name or alias is required for column '...'". Always use `table_name.column_name` format in all snippet types (filters, measures, expressions): + - Correct: `["orders.amount > 1000"]` + - Wrong: `["amount > 1000"]` (UI rejects bare column names) +- **Filters must NOT include the `WHERE` keyword** — only the boolean condition. Genie adds the `WHERE` clause itself: + - Correct: `["orders.amount > 1000"]` + - Wrong: `["WHERE orders.amount > 1000"]` +- **ID Format**: All IDs must be exactly 32 lowercase hexadecimal characters (no hyphens) +- **Sorting**: All arrays of objects with `id` fields must be sorted alphabetically by `id`. Tables must be sorted by `identifier`. `column_configs` must be sorted by `column_name`. +- **Include only what's needed**: Omit sections that don't apply (e.g., skip `metric_views` if none). Note: `benchmarks` is required for new spaces. +- **Join spec format**: The `sql` array requires exactly **two elements**: (1) the join condition using backtick-quoted references, and (2) a `--rt=FROM_RELATIONSHIP_TYPE_...--` annotation. Without the relationship type annotation, the API rejects the request. For multi-column joins, create separate join specs. +- **`text_instructions` content formatting**: The API concatenates `content` array elements without any separator. Each element must end with `\n` or a trailing space to prevent text from being jammed together (e.g., `"Rule one.\n"` not `"Rule one."`) +- **column_configs**: Usually set post-creation via the manage flow or UI. For initial creation, column descriptions and synonyms can be added via `column_configs` in the API, or later through the Genie space UI. Note: `enable_entity_matching` requires `enable_format_assistance` to be `true` — the API will reject configurations where entity matching is enabled but format assistance is not. + +--- + +## ID Generation + +Generate unique 32-character hexadecimal IDs (UUID format without hyphens): + +```python +import secrets +question_id = secrets.token_hex(16) # Generates 32-char hex string +``` + +**Important ID Requirements:** +- Must be exactly 32 characters long +- Must be lowercase hexadecimal (0-9, a-f) +- No hyphens or other separators +- Must be unique within their collection