Skip to content

apps+lakebase: declare-resource as primary path, SQL grant as fallback, psql wrapper#59

Open
jamesbroadhead wants to merge 5 commits intomainfrom
james.broadhead/lakebase-sp-grant
Open

apps+lakebase: declare-resource as primary path, SQL grant as fallback, psql wrapper#59
jamesbroadhead wants to merge 5 commits intomainfrom
james.broadhead/lakebase-sp-grant

Conversation

@jamesbroadhead
Copy link
Copy Markdown

@jamesbroadhead jamesbroadhead commented Apr 28, 2026

Summary

Stacks on top of #56. Three related fixes for what an agent should do when an AppKit Lakebase app fails on first deploy with password authentication failed for user '<SP_UUID>'.

The original framing of this PR (manual databricks_create_role() SQL as the standard step) was wrong — it was reflecting a workaround for my setup gap, not the platform's intended path. Pawel pointed out that the Apps platform auto-creates the SP's Postgres role at attach time when the app declares a database resource (via --set lakebase.postgres.branch=... and --set lakebase.postgres.database=... at apps init, which materializes as a database: block in the app's resources: in databricks.yml). The auth-failed error happens when that resource is missing — the platform never knows the app needs Lakebase access. The fix is to add the resource and redeploy; manual SQL is only needed when the resource form isn't an option (shared Lakebase, custom permissions, post-hoc grants).

The PR has been reframed accordingly. Changes:

skills/databricks-apps/SKILL.md — generic rule, not lakebase-specific:

  • New Step 3 in the scaffolding workflow: after databricks apps init, verify databricks.yml declares every required resource from every plugin in the manifest. Same shape of error appears for missing sql_warehouse (403) and missing genie_space (CAN_RUN denied), not just database — so this rule belongs in the apps skill, not the lakebase one.

skills/databricks-apps/references/appkit/lakebase.md — cross-reference:

  • Prerequisites callout and troubleshooting row both flipped: declare the database resource (re-run apps init with the right --set flags) is the primary fix; manual SQL is the fallback for shared/pre-existing Lakebases.

skills/databricks-lakebase/SKILL.md — what's still useful here:

  • databricks psql --project … -- -d databricks_postgres -f script.sql recipe (one command vs the existing 5-line generate-database-credential + PGPASSWORD form). Useful in its own right.
  • "Grant app SP for AppKit / CRUD apps" SQL block, now clearly marked as fallback to the resource-declaration path.
  • Working databricks postgres create-role CLI form, with body-shape note (--json '{"spec": {...}}', no {"role": ...} wrapper). The CLI body-shape gap is being addressed separately at postgres: add --json body example to create-role help cli#5110.
  • Two troubleshooting rows: password authentication failed for user '<UUID>' (now points at missing-resource as the primary cause) and the Field 'role' is required CLI error (explains the wrapping confusion).

Test plan

  • python3.12 scripts/skills.py validate passes
  • Reproduced the original failure on e2-dogfood: AppKit app with no database resource → SP fails with password authentication failed. Adding the resource via databricks apps init --set lakebase.postgres.branch=... --set lakebase.postgres.database=... and redeploying produced an auto-created SP role and the app started clean. The fallback SQL path is also tested independently.
  • Reviewer confirms the Apps platform auto-grant behavior is the intended path documented elsewhere (Pawel's claim, which matches my repro).

This pull request and its description were written by Isaac.

Comment thread skills/databricks-apps/references/appkit/lakebase.md Outdated
@jamesbroadhead
Copy link
Copy Markdown
Author

Pushed a correction (84e0581) — turns out the CLI is not broken, I'd been wrapping the JSON wrong. The CLI's --json binds to the inner Role object, so the working form is '{"spec": {"identity_type": "SERVICE_PRINCIPAL", ...}}' (no outer {"role": ...} wrapper).

Updated the SKILL.md to show both forms (CLI and SQL) — SQL is still preferable in practice because you can bundle role creation + grants in one psql round-trip — and reframed the troubleshooting row around the wrapping confusion rather than saying the CLI is broken.

The real CLI gap (no convenience flags for the nested spec fields, marked TODO in cmd/workspace/postgres/postgres.go) deserves a separate upstream PR; I'll file that on databricks/cli.

@jamesbroadhead jamesbroadhead changed the title lakebase: SP grant for AppKit/CRUD apps and databricks psql recipe apps+lakebase: declare-resource as primary path, SQL grant as fallback, psql wrapper Apr 28, 2026
@jamesbroadhead
Copy link
Copy Markdown
Author

Big reframe pushed in 03810f6 — Pawel pointed out that the Apps platform auto-creates the SP's Postgres role when the app declares a database resource, so my original "manual SQL grant" framing was treating a setup gap (my databricks.yml was missing the resource) as the standard workflow.

Updated:

  • Primary path is now declare-resource (re-run apps init with --set lakebase.postgres.branch=... + --set lakebase.postgres.database=...), redeploy, platform handles the grant.
  • Manual SQL block is kept as the fallback for shared Lakebases / pre-existing schemas / post-hoc grants where the resource form isn't usable.
  • Added a generic rule to databricks-apps/SKILL.md Step 3: after apps init, verify every required plugin resource is declared. The same failure shape (missing resource → SP can't access at runtime) hits sql_warehouse (403) and genie_space (CAN_RUN denied) too, not just lakebase, so this rule belongs in the apps skill.

Updated PR title and body to match. Diff is now 18 insertions / 14 deletions across three files.

@pkosiec pkosiec force-pushed the pkosiec/lakebase-synced-tables branch from ae8e36f to ea7dc03 Compare April 30, 2026 10:44
Adds the SQL block (with `databricks_create_role` + DML grants) needed
once after the first deploy of an AppKit Lakebase app — without it the
SP gets `password authentication failed`. Promotes `databricks psql` as
the runnable form of #56's manual `generate-database-credential` recipe.
Also flags that the `databricks postgres create-role` CLI rejects every
SP-role payload, so agents stop trying to use it.

Co-authored-by: Isaac
The original commit told agents the `databricks postgres create-role`
CLI couldn't create SP roles. That was wrong — the CLI works with
`--json '{"spec": {...}}'` (fields go on the inner Role object, not
wrapped under `{"role": ...}`). The "Field 'role' is required" error
fires when the inner Role has no recognized fields, which happens when
the body is wrapped — the CLI strips `role` as unknown and ships an
empty body.

Show the working CLI form alongside the SQL block (`databricks_create_role()`
is still preferable in practice because it bundles role creation + grants
into one psql round-trip), and rewrite the troubleshooting row to point
at the wrapping confusion instead of saying the CLI is broken.

Also calls out that the CLI doesn't yet expose convenience flags for
nested spec fields (TODOs in cmd/workspace/postgres/postgres.go) — that
is the real gap, and a separate CLI PR is appropriate.

Co-authored-by: Isaac
…llback

Reframes the SP-grant guidance around the actual root cause: the Apps
platform auto-creates the SP's Postgres role on deploy when the app
declares a `database` resource (via `--set lakebase.postgres.branch=…`
and `--set lakebase.postgres.database=…` at init time, materialized as a
`database:` block in the app's `resources:` in `databricks.yml`). The
manual `databricks_create_role()` SQL block from the prior commit moves
to a fallback for shared/pre-existing Lakebases where the resource form
isn't usable.

Adds a generic rule to `databricks-apps/SKILL.md` (Step 3 after init):
verify the resources block in `databricks.yml` contains an entry for
every required resource from every plugin in the manifest. Same shape
of error appears for missing sql_warehouse (403) and missing genie_space
(CAN_RUN denied), not just lakebase — so this lives in the apps skill,
not the lakebase one.

Cross-refs and troubleshooting rows in `appkit/lakebase.md` updated to
point at the resource fix first, manual SQL second.

Co-authored-by: Isaac
- The two `databricks psql` examples placed `--profile <PROFILE>` AFTER
  `--`, so the flag was forwarded to psql and caused
  `psql: error: unrecognized option`. Move it before `--` and add a
  short note explaining the separator semantics.
- The fallback `databricks postgres create-role` example shipped with
  `membership_roles: ["DATABRICKS_SUPERUSER"]`, contradicting the
  least-privilege grant block immediately above it. Remove it from the
  example and add a least-privilege caveat.
- `ALTER DEFAULT PRIVILEGES` without `FOR ROLE` only applies to tables
  created by the running role, so future synced tables created by the
  sync pipeline role won't pick up the grant. Add a caveat with both
  workarounds.
- The connectivity.md `resolve_host` snippet would crash with
  unhandled FileNotFoundError when `dig` is missing. Wrap the
  subprocess.run call and raise a RuntimeError with installation
  guidance.

Co-authored-by: Isaac
The local development snippet hardcoded PGUSER=<service principal client ID>,
but AppKit's local dev server authenticates as the developer's Databricks
user by default — so the OAuth token (PGPASSWORD) was for the developer
while PGUSER named the SP, which Postgres rejects with
"password authentication failed for user '<UUID>'".

Replace the hardcoded value with a note that explains both paths:
- Default (personal profile): PGUSER is your Databricks username/email.
- Testing the deployed flow locally: export DATABRICKS_CLIENT_ID and
  DATABRICKS_CLIENT_SECRET so the dev server authenticates as the SP,
  then PGUSER=<SP_CLIENT_ID> matches.

Co-authored-by: Isaac
@jamesbroadhead jamesbroadhead force-pushed the james.broadhead/lakebase-sp-grant branch from 900b3d3 to eb8ee35 Compare April 30, 2026 16:02
@jamesbroadhead jamesbroadhead changed the base branch from pkosiec/lakebase-synced-tables to main April 30, 2026 16:03
Copy link
Copy Markdown
Member

@pkosiec pkosiec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the databricks psql recommendation + the databricks.yml resources validation 👍 Here are my few comments.

Before next round of review, please symlink the skills and try them by yourself, just to ensure everything works as expected. Thank you!


If you skip this step, the Service Principal won't own the database schema. You'll create schemas under your credentials that the SP **cannot access** after deployment. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for the full workflow and recovery steps.

> **First deploy with `lakebase`:** confirm `databricks.yml` declares a `database` resource on the app (alongside `sql_warehouse`, `genie_space`, etc.). Apps platform auto-creates the SP's Postgres role only when the database is attached as an app resource — without it, the deployed app fails with `password authentication failed for user '<UUID>'`. If the resource is missing, re-run `databricks apps init` with `--set lakebase.postgres.branch=...` and `--set lakebase.postgres.database=...`; if you can't (shared Lakebase, custom permissions), use the manual SQL fallback in the **`databricks-lakebase`** skill's **Grant app SP for AppKit / CRUD apps** section.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

database is the old Lakebase Autoscaling resource, so IMO agent might be confused with such resource name.

The new (Lakebase Autoscaling) resource is postgres:

Example:

bundle:
  name: lakebs

variables:
  postgres_branch:
    description: Full Lakebase Postgres branch resource name. Obtain by running `databricks postgres list-branches projects/{project-id}`, select the desired item from the output array and use its .name value.
  postgres_database:
    description: Full Lakebase Postgres database resource name. Obtain by running `databricks postgres list-databases {branch-name}`, select the desired item from the output array and use its .name value. Requires the branch resource name.

resources:
  apps:
    app:
      name: "lakebs"
      description: "A Databricks App powered by AppKit"
      source_code_path: ./
      # Uncomment to enable on behalf of user API scopes. Available scopes: sql, dashboards.genie, files.files, serving.serving-endpoints
      # user_api_scopes:
      #   - sql

      # The resources which this app has access to.
      resources:
        - name: postgres
          postgres:
            branch: ${var.postgres_branch}
            database: ${var.postgres_database}
            permission: CAN_CONNECT_AND_CREATE

targets:
  default:
    default: true
    workspace:
      host: https://e2-dogfood.staging.cloud.databricks.com

    variables:
      postgres_branch: projects/pkosiec/branches/production
      postgres_database: projects/pkosiec/branches/production/databases/db-dmfv-24qipl4z1k

> **`PGUSER` must match the credentials the AppKit dev server uses.** The Postgres role in `PGUSER` has to correspond to the principal that produced `PGPASSWORD` (the OAuth token).
>
> - **Default (personal Databricks profile):** AppKit's local server authenticates as your Databricks user, so `PGUSER` is your Databricks username/email. Tables created locally will be owned by your user, not the SP — that's why the deploy-first workflow exists.
> - **Testing the deployed flow locally:** export `DATABRICKS_CLIENT_ID=<SP_CLIENT_ID>` and `DATABRICKS_CLIENT_SECRET=...` so the dev server authenticates as the SP. Then `PGUSER=<SP_CLIENT_ID>` matches.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't possible locally, we cannot get Service Principal's client secret.


If you skip this step, the Service Principal won't own the database schema. You'll create schemas under your credentials that the SP **cannot access** after deployment. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for the full workflow and recovery steps.

> **First deploy with `lakebase`:** confirm `databricks.yml` declares a `database` resource on the app (alongside `sql_warehouse`, `genie_space`, etc.). Apps platform auto-creates the SP's Postgres role only when the database is attached as an app resource — without it, the deployed app fails with `password authentication failed for user '<UUID>'`. If the resource is missing, re-run `databricks apps init` with `--set lakebase.postgres.branch=...` and `--set lakebase.postgres.database=...`; if you can't (shared Lakebase, custom permissions), use the manual SQL fallback in the **`databricks-lakebase`** skill's **Grant app SP for AppKit / CRUD apps** section.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid duplicating the apps init command and instead, point Agent to the Scaffolding section? It'll be hard to maintain so many occurrences of the init command. Thanks!

Comment on lines +143 to +178
## DNS Resolution (macOS)

Python's `socket.getaddrinfo()` can fail with long Lakebase hostnames on macOS. Workaround: resolve via `dig`, then pass the IP through `hostaddr` while keeping `host` for TLS SNI.

```bash
# Resolve the Lakebase hostname to an IP
dig +short <ENDPOINT_HOST>
```

```python
import subprocess

def resolve_host(hostname: str) -> str:
try:
result = subprocess.run(
["dig", "+short", hostname], capture_output=True, text=True, check=False
)
except FileNotFoundError as e:
raise RuntimeError("'dig' is not installed; install it (e.g. `apt-get install dnsutils`) or use socket.getaddrinfo() instead") from e
lines = result.stdout.strip().splitlines()
if not lines:
raise RuntimeError(f"DNS resolution failed for {hostname}")
return lines[0]

ip = resolve_host(endpoint_host)

conn = psycopg.connect(
host=endpoint_host, # kept for TLS SNI verification
hostaddr=ip, # bypasses getaddrinfo()
dbname="databricks_postgres",
user=username,
password=token,
sslmode="require",
)
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a part of my previous PR but looks like it is not needed after all - could you please cherry pick your commits on top of main to ensure only your changes are added here? Thanks!


**DO NOT guess** plugin names, resource keys, or property names — always derive them from `databricks apps manifest` output. Example: if the manifest shows plugin `analytics` with a required resource `resourceKey: "sql-warehouse"` and `fields: { "id": ... }`, include `--set analytics.sql-warehouse.id=<ID>`.

3. **Verify resources after init.** Open `databricks.yml` and confirm `resources.apps.<app>.resources` contains a block for **every** required resource from every plugin you included (manifest's `resources.required`). For example, `--features analytics,genie,lakebase` must produce three blocks: `sql_warehouse`, `genie_space`, **and** `database`. A missing resource means the Apps platform won't grant the SP access to that resource at deploy time, and the app will fail at runtime — typically with `password authentication failed for user '<SP_UUID>'` (Lakebase), `403` from the SQL warehouse, or `CAN_RUN denied` (Genie). Fix by re-running `init` with the missing `--set` flag, not by hand-editing the YAML — the YAML is a generated artifact and your edit will be lost the next time someone re-scaffolds.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above:

  • database isn't the right resource name
  • it would be great to point to the Scaffolding section again

> - **Default (personal Databricks profile):** AppKit's local server authenticates as your Databricks user, so `PGUSER` is your Databricks username/email. Tables created locally will be owned by your user, not the SP — that's why the deploy-first workflow exists.
> - **Testing the deployed flow locally:** export `DATABRICKS_CLIENT_ID=<SP_CLIENT_ID>` and `DATABRICKS_CLIENT_SECRET=...` so the dev server authenticates as the SP. Then `PGUSER=<SP_CLIENT_ID>` matches.
>
> If `PGUSER` and the OAuth token disagree, Postgres rejects the connection with `password authentication failed for user '<UUID>'`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'd rather revert changes from line 264 to 277: they don't seem to bring any benefit unless I'm mistaken?.
Maybe we can just change the line 266 and say that:

  • PGUSER doesn't need to be provided when running the app locally (why? we use the currently logged user to Databricks CLI -> don't say that in the skill, no need)
  • PGUSER is injected automatically when app is deployed on Databricks Apps

| `permission denied for schema <name>` | Schema was created by another role (e.g. you ran locally before deploying) | **Ask the user before dropping** — `DROP SCHEMA` deletes all data. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for options |
| Works locally but `permission denied` after deploy | Local credentials created the schema; the SP can't access schemas it doesn't own | **Ask the user before dropping** — warn about data loss, then deploy first. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for options |
| `connection refused` | Pool not connected or wrong env vars | Check `PGHOST`, `PGPORT`, `LAKEBASE_ENDPOINT` are set |
| `password authentication failed for user '<UUID>'` | App's `databricks.yml` is missing a `database` resource — Apps platform never auto-created the SP's Postgres role on attach | Add the missing `database` resource (re-run `databricks apps init` with `--set lakebase.postgres.branch=...` and `--set lakebase.postgres.database=...`), redeploy. Manual SQL fallback: see **`databricks-lakebase`**'s **Grant app SP for AppKit / CRUD apps** |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as before - let's point to the Scaffolding section

```
For least-privilege, consider syncing into a dedicated schema instead of `public` so the grant is scoped to synced data only.

> **Default privileges caveat.** `ALTER DEFAULT PRIVILEGES` without `FOR ROLE` only applies to tables created by the role running this statement. If sync pipelines create new tables under a different role, re-run `GRANT SELECT ON ALL TABLES IN SCHEMA public TO "<SP_CLIENT_ID>"` after each new table appears, or add `FOR ROLE <pipeline_role>` once you know which role the sync runs as.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the snippet above (242-246) already does, does it make sense to repeat it?

Comment on lines +250 to +285
**Grant app SP for AppKit / CRUD apps** (full DML).

> **First check: is the Lakebase declared as an app resource?** When the Apps platform attaches a `database` resource (declared in the app's `databricks.yml` under `resources.apps.<app>.resources`) to an app on deploy, it auto-creates the SP's Postgres role with `CAN_CONNECT_AND_CREATE`. If the SP is failing to connect with `password authentication failed for user '<SP_CLIENT_ID>'`, the most likely cause is a missing `database` resource — fix that first, redeploy, and the auto-grant fires. See the `databricks-apps` skill (Scaffolding) for verifying every required plugin resource is declared.
>
> The SQL block below is the **fallback** for cases the resource form doesn't cover: granting access to an existing Lakebase the app spec doesn't own (shared across apps, pre-existing schema with custom permissions, post-hoc grants for additional tables/sequences).

Manual fallback — create the role and grant DML, in one psql round-trip:
```sql
CREATE EXTENSION IF NOT EXISTS databricks_auth;

DO $$
DECLARE
sp TEXT := '<SP_CLIENT_ID>'; -- from `databricks apps get <APP> -o json | jq -r .service_principal_client_id`
BEGIN
PERFORM databricks_create_role(sp, 'SERVICE_PRINCIPAL');
EXECUTE format('GRANT CONNECT ON DATABASE "databricks_postgres" TO %I', sp);
EXECUTE format('GRANT ALL ON SCHEMA public TO %I', sp);
EXECUTE format('GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO %I', sp);
EXECUTE format('GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO %I', sp);
EXECUTE format('ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON TABLES TO %I', sp);
EXECUTE format('ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON SEQUENCES TO %I', sp);
END $$;
```
Pipe through `databricks psql` (above). The block is idempotent; re-running is safe.

The role-creation step alone has a CLI form too (useful when granting privileges separately):
```bash
databricks postgres create-role projects/<PROJECT_ID>/branches/<BRANCH_ID> \
--role-id <SP_CLIENT_ID> \
--json '{"spec":{"identity_type":"SERVICE_PRINCIPAL","postgres_role":"<SP_CLIENT_ID>","auth_method":"LAKEBASE_OAUTH_V1"}}' \
--profile <PROFILE>
```

> **Least privilege.** The example creates the role with default privileges only — grant database/schema/table access via the explicit `GRANT` statements above. Don't add `membership_roles: ["DATABRICKS_SUPERUSER"]` for an app SP unless broad administrative access is intentional; superuser membership lets the app role read every Lakebase database, not just its own.

> **CLI body shape.** `databricks postgres create-role`'s `--json` flag binds to the inner `Role` object — fields go directly under `spec`, **not** wrapped in `{"role": ...}`. The error `Field 'role' is required and must contain at least one subfield with a non-default value` means the inner Role had no recognized fields (often because someone wrapped the body, which the CLI strips with `Warning: unknown field: role` and ships an empty body). The CLI also doesn't yet expose convenience flags like `--spec.identity-type` ([cmd/workspace/postgres/postgres.go](https://github.com/databricks/cli/blob/main/cmd/workspace/postgres/postgres.go) marks `spec` as TODO), so you must hand-craft the JSON.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do that:
once the Lakebase project (branch) is added as a resource to an App, the App runtime creates proper roles with proper permissions. In this PR you already corrected agent to start again from the scaffolding section if the resource is not in the databricks.yml (that's a very good addition).

But let's not try to workaround the whole App resource mechanism. App that uses Lakebase must define the project as an app resource - this is a strict prerequisite here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants