apps+lakebase: declare-resource as primary path, SQL grant as fallback, psql wrapper#59
apps+lakebase: declare-resource as primary path, SQL grant as fallback, psql wrapper#59jamesbroadhead wants to merge 5 commits intomainfrom
Conversation
|
Pushed a correction (84e0581) — turns out the CLI is not broken, I'd been wrapping the JSON wrong. The CLI's Updated the SKILL.md to show both forms (CLI and SQL) — SQL is still preferable in practice because you can bundle role creation + grants in one psql round-trip — and reframed the troubleshooting row around the wrapping confusion rather than saying the CLI is broken. The real CLI gap (no convenience flags for the nested |
databricks psql recipe|
Big reframe pushed in 03810f6 — Pawel pointed out that the Apps platform auto-creates the SP's Postgres role when the app declares a Updated:
Updated PR title and body to match. Diff is now 18 insertions / 14 deletions across three files. |
ae8e36f to
ea7dc03
Compare
Adds the SQL block (with `databricks_create_role` + DML grants) needed once after the first deploy of an AppKit Lakebase app — without it the SP gets `password authentication failed`. Promotes `databricks psql` as the runnable form of #56's manual `generate-database-credential` recipe. Also flags that the `databricks postgres create-role` CLI rejects every SP-role payload, so agents stop trying to use it. Co-authored-by: Isaac
The original commit told agents the `databricks postgres create-role`
CLI couldn't create SP roles. That was wrong — the CLI works with
`--json '{"spec": {...}}'` (fields go on the inner Role object, not
wrapped under `{"role": ...}`). The "Field 'role' is required" error
fires when the inner Role has no recognized fields, which happens when
the body is wrapped — the CLI strips `role` as unknown and ships an
empty body.
Show the working CLI form alongside the SQL block (`databricks_create_role()`
is still preferable in practice because it bundles role creation + grants
into one psql round-trip), and rewrite the troubleshooting row to point
at the wrapping confusion instead of saying the CLI is broken.
Also calls out that the CLI doesn't yet expose convenience flags for
nested spec fields (TODOs in cmd/workspace/postgres/postgres.go) — that
is the real gap, and a separate CLI PR is appropriate.
Co-authored-by: Isaac
…llback Reframes the SP-grant guidance around the actual root cause: the Apps platform auto-creates the SP's Postgres role on deploy when the app declares a `database` resource (via `--set lakebase.postgres.branch=…` and `--set lakebase.postgres.database=…` at init time, materialized as a `database:` block in the app's `resources:` in `databricks.yml`). The manual `databricks_create_role()` SQL block from the prior commit moves to a fallback for shared/pre-existing Lakebases where the resource form isn't usable. Adds a generic rule to `databricks-apps/SKILL.md` (Step 3 after init): verify the resources block in `databricks.yml` contains an entry for every required resource from every plugin in the manifest. Same shape of error appears for missing sql_warehouse (403) and missing genie_space (CAN_RUN denied), not just lakebase — so this lives in the apps skill, not the lakebase one. Cross-refs and troubleshooting rows in `appkit/lakebase.md` updated to point at the resource fix first, manual SQL second. Co-authored-by: Isaac
- The two `databricks psql` examples placed `--profile <PROFILE>` AFTER `--`, so the flag was forwarded to psql and caused `psql: error: unrecognized option`. Move it before `--` and add a short note explaining the separator semantics. - The fallback `databricks postgres create-role` example shipped with `membership_roles: ["DATABRICKS_SUPERUSER"]`, contradicting the least-privilege grant block immediately above it. Remove it from the example and add a least-privilege caveat. - `ALTER DEFAULT PRIVILEGES` without `FOR ROLE` only applies to tables created by the running role, so future synced tables created by the sync pipeline role won't pick up the grant. Add a caveat with both workarounds. - The connectivity.md `resolve_host` snippet would crash with unhandled FileNotFoundError when `dig` is missing. Wrap the subprocess.run call and raise a RuntimeError with installation guidance. Co-authored-by: Isaac
The local development snippet hardcoded PGUSER=<service principal client ID>, but AppKit's local dev server authenticates as the developer's Databricks user by default — so the OAuth token (PGPASSWORD) was for the developer while PGUSER named the SP, which Postgres rejects with "password authentication failed for user '<UUID>'". Replace the hardcoded value with a note that explains both paths: - Default (personal profile): PGUSER is your Databricks username/email. - Testing the deployed flow locally: export DATABRICKS_CLIENT_ID and DATABRICKS_CLIENT_SECRET so the dev server authenticates as the SP, then PGUSER=<SP_CLIENT_ID> matches. Co-authored-by: Isaac
900b3d3 to
eb8ee35
Compare
pkosiec
left a comment
There was a problem hiding this comment.
I like the databricks psql recommendation + the databricks.yml resources validation 👍 Here are my few comments.
Before next round of review, please symlink the skills and try them by yourself, just to ensure everything works as expected. Thank you!
|
|
||
| If you skip this step, the Service Principal won't own the database schema. You'll create schemas under your credentials that the SP **cannot access** after deployment. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for the full workflow and recovery steps. | ||
|
|
||
| > **First deploy with `lakebase`:** confirm `databricks.yml` declares a `database` resource on the app (alongside `sql_warehouse`, `genie_space`, etc.). Apps platform auto-creates the SP's Postgres role only when the database is attached as an app resource — without it, the deployed app fails with `password authentication failed for user '<UUID>'`. If the resource is missing, re-run `databricks apps init` with `--set lakebase.postgres.branch=...` and `--set lakebase.postgres.database=...`; if you can't (shared Lakebase, custom permissions), use the manual SQL fallback in the **`databricks-lakebase`** skill's **Grant app SP for AppKit / CRUD apps** section. |
There was a problem hiding this comment.
database is the old Lakebase Autoscaling resource, so IMO agent might be confused with such resource name.
The new (Lakebase Autoscaling) resource is postgres:
Example:
bundle:
name: lakebs
variables:
postgres_branch:
description: Full Lakebase Postgres branch resource name. Obtain by running `databricks postgres list-branches projects/{project-id}`, select the desired item from the output array and use its .name value.
postgres_database:
description: Full Lakebase Postgres database resource name. Obtain by running `databricks postgres list-databases {branch-name}`, select the desired item from the output array and use its .name value. Requires the branch resource name.
resources:
apps:
app:
name: "lakebs"
description: "A Databricks App powered by AppKit"
source_code_path: ./
# Uncomment to enable on behalf of user API scopes. Available scopes: sql, dashboards.genie, files.files, serving.serving-endpoints
# user_api_scopes:
# - sql
# The resources which this app has access to.
resources:
- name: postgres
postgres:
branch: ${var.postgres_branch}
database: ${var.postgres_database}
permission: CAN_CONNECT_AND_CREATE
targets:
default:
default: true
workspace:
host: https://e2-dogfood.staging.cloud.databricks.com
variables:
postgres_branch: projects/pkosiec/branches/production
postgres_database: projects/pkosiec/branches/production/databases/db-dmfv-24qipl4z1k
| > **`PGUSER` must match the credentials the AppKit dev server uses.** The Postgres role in `PGUSER` has to correspond to the principal that produced `PGPASSWORD` (the OAuth token). | ||
| > | ||
| > - **Default (personal Databricks profile):** AppKit's local server authenticates as your Databricks user, so `PGUSER` is your Databricks username/email. Tables created locally will be owned by your user, not the SP — that's why the deploy-first workflow exists. | ||
| > - **Testing the deployed flow locally:** export `DATABRICKS_CLIENT_ID=<SP_CLIENT_ID>` and `DATABRICKS_CLIENT_SECRET=...` so the dev server authenticates as the SP. Then `PGUSER=<SP_CLIENT_ID>` matches. |
There was a problem hiding this comment.
This isn't possible locally, we cannot get Service Principal's client secret.
|
|
||
| If you skip this step, the Service Principal won't own the database schema. You'll create schemas under your credentials that the SP **cannot access** after deployment. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for the full workflow and recovery steps. | ||
|
|
||
| > **First deploy with `lakebase`:** confirm `databricks.yml` declares a `database` resource on the app (alongside `sql_warehouse`, `genie_space`, etc.). Apps platform auto-creates the SP's Postgres role only when the database is attached as an app resource — without it, the deployed app fails with `password authentication failed for user '<UUID>'`. If the resource is missing, re-run `databricks apps init` with `--set lakebase.postgres.branch=...` and `--set lakebase.postgres.database=...`; if you can't (shared Lakebase, custom permissions), use the manual SQL fallback in the **`databricks-lakebase`** skill's **Grant app SP for AppKit / CRUD apps** section. |
There was a problem hiding this comment.
Can we avoid duplicating the apps init command and instead, point Agent to the Scaffolding section? It'll be hard to maintain so many occurrences of the init command. Thanks!
| ## DNS Resolution (macOS) | ||
|
|
||
| Python's `socket.getaddrinfo()` can fail with long Lakebase hostnames on macOS. Workaround: resolve via `dig`, then pass the IP through `hostaddr` while keeping `host` for TLS SNI. | ||
|
|
||
| ```bash | ||
| # Resolve the Lakebase hostname to an IP | ||
| dig +short <ENDPOINT_HOST> | ||
| ``` | ||
|
|
||
| ```python | ||
| import subprocess | ||
|
|
||
| def resolve_host(hostname: str) -> str: | ||
| try: | ||
| result = subprocess.run( | ||
| ["dig", "+short", hostname], capture_output=True, text=True, check=False | ||
| ) | ||
| except FileNotFoundError as e: | ||
| raise RuntimeError("'dig' is not installed; install it (e.g. `apt-get install dnsutils`) or use socket.getaddrinfo() instead") from e | ||
| lines = result.stdout.strip().splitlines() | ||
| if not lines: | ||
| raise RuntimeError(f"DNS resolution failed for {hostname}") | ||
| return lines[0] | ||
|
|
||
| ip = resolve_host(endpoint_host) | ||
|
|
||
| conn = psycopg.connect( | ||
| host=endpoint_host, # kept for TLS SNI verification | ||
| hostaddr=ip, # bypasses getaddrinfo() | ||
| dbname="databricks_postgres", | ||
| user=username, | ||
| password=token, | ||
| sslmode="require", | ||
| ) | ||
| ``` | ||
|
|
There was a problem hiding this comment.
It was a part of my previous PR but looks like it is not needed after all - could you please cherry pick your commits on top of main to ensure only your changes are added here? Thanks!
|
|
||
| **DO NOT guess** plugin names, resource keys, or property names — always derive them from `databricks apps manifest` output. Example: if the manifest shows plugin `analytics` with a required resource `resourceKey: "sql-warehouse"` and `fields: { "id": ... }`, include `--set analytics.sql-warehouse.id=<ID>`. | ||
|
|
||
| 3. **Verify resources after init.** Open `databricks.yml` and confirm `resources.apps.<app>.resources` contains a block for **every** required resource from every plugin you included (manifest's `resources.required`). For example, `--features analytics,genie,lakebase` must produce three blocks: `sql_warehouse`, `genie_space`, **and** `database`. A missing resource means the Apps platform won't grant the SP access to that resource at deploy time, and the app will fail at runtime — typically with `password authentication failed for user '<SP_UUID>'` (Lakebase), `403` from the SQL warehouse, or `CAN_RUN denied` (Genie). Fix by re-running `init` with the missing `--set` flag, not by hand-editing the YAML — the YAML is a generated artifact and your edit will be lost the next time someone re-scaffolds. |
There was a problem hiding this comment.
Same comment as above:
databaseisn't the right resource name- it would be great to point to the Scaffolding section again
| > - **Default (personal Databricks profile):** AppKit's local server authenticates as your Databricks user, so `PGUSER` is your Databricks username/email. Tables created locally will be owned by your user, not the SP — that's why the deploy-first workflow exists. | ||
| > - **Testing the deployed flow locally:** export `DATABRICKS_CLIENT_ID=<SP_CLIENT_ID>` and `DATABRICKS_CLIENT_SECRET=...` so the dev server authenticates as the SP. Then `PGUSER=<SP_CLIENT_ID>` matches. | ||
| > | ||
| > If `PGUSER` and the OAuth token disagree, Postgres rejects the connection with `password authentication failed for user '<UUID>'`. |
There was a problem hiding this comment.
Honestly, I'd rather revert changes from line 264 to 277: they don't seem to bring any benefit unless I'm mistaken?.
Maybe we can just change the line 266 and say that:
- PGUSER doesn't need to be provided when running the app locally (why? we use the currently logged user to Databricks CLI -> don't say that in the skill, no need)
- PGUSER is injected automatically when app is deployed on Databricks Apps
| | `permission denied for schema <name>` | Schema was created by another role (e.g. you ran locally before deploying) | **Ask the user before dropping** — `DROP SCHEMA` deletes all data. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for options | | ||
| | Works locally but `permission denied` after deploy | Local credentials created the schema; the SP can't access schemas it doesn't own | **Ask the user before dropping** — warn about data loss, then deploy first. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for options | | ||
| | `connection refused` | Pool not connected or wrong env vars | Check `PGHOST`, `PGPORT`, `LAKEBASE_ENDPOINT` are set | | ||
| | `password authentication failed for user '<UUID>'` | App's `databricks.yml` is missing a `database` resource — Apps platform never auto-created the SP's Postgres role on attach | Add the missing `database` resource (re-run `databricks apps init` with `--set lakebase.postgres.branch=...` and `--set lakebase.postgres.database=...`), redeploy. Manual SQL fallback: see **`databricks-lakebase`**'s **Grant app SP for AppKit / CRUD apps** | |
There was a problem hiding this comment.
Similarly as before - let's point to the Scaffolding section
| ``` | ||
| For least-privilege, consider syncing into a dedicated schema instead of `public` so the grant is scoped to synced data only. | ||
|
|
||
| > **Default privileges caveat.** `ALTER DEFAULT PRIVILEGES` without `FOR ROLE` only applies to tables created by the role running this statement. If sync pipelines create new tables under a different role, re-run `GRANT SELECT ON ALL TABLES IN SCHEMA public TO "<SP_CLIENT_ID>"` after each new table appears, or add `FOR ROLE <pipeline_role>` once you know which role the sync runs as. |
There was a problem hiding this comment.
This is what the snippet above (242-246) already does, does it make sense to repeat it?
| **Grant app SP for AppKit / CRUD apps** (full DML). | ||
|
|
||
| > **First check: is the Lakebase declared as an app resource?** When the Apps platform attaches a `database` resource (declared in the app's `databricks.yml` under `resources.apps.<app>.resources`) to an app on deploy, it auto-creates the SP's Postgres role with `CAN_CONNECT_AND_CREATE`. If the SP is failing to connect with `password authentication failed for user '<SP_CLIENT_ID>'`, the most likely cause is a missing `database` resource — fix that first, redeploy, and the auto-grant fires. See the `databricks-apps` skill (Scaffolding) for verifying every required plugin resource is declared. | ||
| > | ||
| > The SQL block below is the **fallback** for cases the resource form doesn't cover: granting access to an existing Lakebase the app spec doesn't own (shared across apps, pre-existing schema with custom permissions, post-hoc grants for additional tables/sequences). | ||
|
|
||
| Manual fallback — create the role and grant DML, in one psql round-trip: | ||
| ```sql | ||
| CREATE EXTENSION IF NOT EXISTS databricks_auth; | ||
|
|
||
| DO $$ | ||
| DECLARE | ||
| sp TEXT := '<SP_CLIENT_ID>'; -- from `databricks apps get <APP> -o json | jq -r .service_principal_client_id` | ||
| BEGIN | ||
| PERFORM databricks_create_role(sp, 'SERVICE_PRINCIPAL'); | ||
| EXECUTE format('GRANT CONNECT ON DATABASE "databricks_postgres" TO %I', sp); | ||
| EXECUTE format('GRANT ALL ON SCHEMA public TO %I', sp); | ||
| EXECUTE format('GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO %I', sp); | ||
| EXECUTE format('GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO %I', sp); | ||
| EXECUTE format('ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON TABLES TO %I', sp); | ||
| EXECUTE format('ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON SEQUENCES TO %I', sp); | ||
| END $$; | ||
| ``` | ||
| Pipe through `databricks psql` (above). The block is idempotent; re-running is safe. | ||
|
|
||
| The role-creation step alone has a CLI form too (useful when granting privileges separately): | ||
| ```bash | ||
| databricks postgres create-role projects/<PROJECT_ID>/branches/<BRANCH_ID> \ | ||
| --role-id <SP_CLIENT_ID> \ | ||
| --json '{"spec":{"identity_type":"SERVICE_PRINCIPAL","postgres_role":"<SP_CLIENT_ID>","auth_method":"LAKEBASE_OAUTH_V1"}}' \ | ||
| --profile <PROFILE> | ||
| ``` | ||
|
|
||
| > **Least privilege.** The example creates the role with default privileges only — grant database/schema/table access via the explicit `GRANT` statements above. Don't add `membership_roles: ["DATABRICKS_SUPERUSER"]` for an app SP unless broad administrative access is intentional; superuser membership lets the app role read every Lakebase database, not just its own. | ||
|
|
||
| > **CLI body shape.** `databricks postgres create-role`'s `--json` flag binds to the inner `Role` object — fields go directly under `spec`, **not** wrapped in `{"role": ...}`. The error `Field 'role' is required and must contain at least one subfield with a non-default value` means the inner Role had no recognized fields (often because someone wrapped the body, which the CLI strips with `Warning: unknown field: role` and ships an empty body). The CLI also doesn't yet expose convenience flags like `--spec.identity-type` ([cmd/workspace/postgres/postgres.go](https://github.com/databricks/cli/blob/main/cmd/workspace/postgres/postgres.go) marks `spec` as TODO), so you must hand-craft the JSON. |
There was a problem hiding this comment.
I don't think we should do that:
once the Lakebase project (branch) is added as a resource to an App, the App runtime creates proper roles with proper permissions. In this PR you already corrected agent to start again from the scaffolding section if the resource is not in the databricks.yml (that's a very good addition).
But let's not try to workaround the whole App resource mechanism. App that uses Lakebase must define the project as an app resource - this is a strict prerequisite here.
Summary
Stacks on top of #56. Three related fixes for what an agent should do when an AppKit Lakebase app fails on first deploy with
password authentication failed for user '<SP_UUID>'.The original framing of this PR (manual
databricks_create_role()SQL as the standard step) was wrong — it was reflecting a workaround for my setup gap, not the platform's intended path. Pawel pointed out that the Apps platform auto-creates the SP's Postgres role at attach time when the app declares adatabaseresource (via--set lakebase.postgres.branch=...and--set lakebase.postgres.database=...atapps init, which materializes as adatabase:block in the app'sresources:indatabricks.yml). The auth-failed error happens when that resource is missing — the platform never knows the app needs Lakebase access. The fix is to add the resource and redeploy; manual SQL is only needed when the resource form isn't an option (shared Lakebase, custom permissions, post-hoc grants).The PR has been reframed accordingly. Changes:
skills/databricks-apps/SKILL.md— generic rule, not lakebase-specific:databricks apps init, verifydatabricks.ymldeclares every required resource from every plugin in the manifest. Same shape of error appears for missingsql_warehouse(403) and missinggenie_space(CAN_RUN denied), not justdatabase— so this rule belongs in the apps skill, not the lakebase one.skills/databricks-apps/references/appkit/lakebase.md— cross-reference:databaseresource (re-runapps initwith the right--setflags) is the primary fix; manual SQL is the fallback for shared/pre-existing Lakebases.skills/databricks-lakebase/SKILL.md— what's still useful here:databricks psql --project … -- -d databricks_postgres -f script.sqlrecipe (one command vs the existing 5-linegenerate-database-credential+PGPASSWORDform). Useful in its own right.databricks postgres create-roleCLI form, with body-shape note (--json '{"spec": {...}}', no{"role": ...}wrapper). The CLI body-shape gap is being addressed separately at postgres: add--jsonbody example to create-role help cli#5110.password authentication failed for user '<UUID>'(now points at missing-resource as the primary cause) and theField 'role' is requiredCLI error (explains the wrapping confusion).Test plan
python3.12 scripts/skills.py validatepassese2-dogfood: AppKit app with nodatabaseresource → SP fails withpassword authentication failed. Adding the resource viadatabricks apps init --set lakebase.postgres.branch=... --set lakebase.postgres.database=...and redeploying produced an auto-created SP role and the app started clean. The fallback SQL path is also tested independently.This pull request and its description were written by Isaac.