Skip to content

Feat/hetzner deploy design#3

Open
aleksei-okatiev wants to merge 21 commits into
developfrom
feat/hetzner-deploy-design
Open

Feat/hetzner deploy design#3
aleksei-okatiev wants to merge 21 commits into
developfrom
feat/hetzner-deploy-design

Conversation

@aleksei-okatiev
Copy link
Copy Markdown

This is just an dirty attempt to adjust browserkube to agenetic QA needs to be able to connect to tooling

aleksei-okatiev and others added 21 commits May 4, 2026 13:46
Captures the agreed design for running browserkube on a single Hetzner
Cloud VM beside testinator's docker-compose stack: k3s with ingress-nginx
on hostPort 8080, build-on-host image pipeline imported into containerd,
and a path forward to TLS / public auth without rebuilding the cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 bite-sized tasks: scaffold deploy/hetzner/, write k3s/ingress/ufw
bootstrap scripts, write Helm values + Makefile + README, run end-to-end
on the Hetzner box including testinator tooling reconfig and smoke test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…already echo, comment)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…al ingress hosts)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Chart resolves images via .<component>.tag | default .Chart.AppVersion,
so the global --set tag=... in Task 5/6 was a no-op. Update Task 6 to
add a values-key column to the IMAGES table and generate per-component
--set <key>.tag=<sha> flags. Update Task 5 verification expectations
and add a git-archive workaround for local validation when the working
tree carries broken in-flight chart edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The IMAGES table was widened to 5 columns (added values-key) in the
plan revision, but _build_one's awk stride was left at i+=4. That made
`make image-<X>` fail for everything except the first row (browserkube).
Tested with sidecar (row 2) and playwright-webkit (row 14): both now
resolve correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
helm lint fails on pre-existing chart errors (Chart.yaml apiVersion
mismatch, browserkube-quotas.yaml empty name) that aren't this work's
to fix. With lint as a deploy prereq, `make deploy` would always fail
before reaching helm upgrade. Drop lint from deploy's deps; keep it as
a standalone advisory target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In-flight playwright/MCP integration, committed onto the Hetzner deploy
branch so the chart renders against our locally-built playwright-firefox
/ chromium / webkit images instead of upstream quay.io/browser/* (which
predate browser.bind() and break MCP attach).

Changes:
- cmds/playwright-{firefox,chromium,webkit}/: new bind-server.js images.
  Each runs `<engine>.launch()` once, calls `browser.bind()`, and proxies
  the random-port endpoint through a stable :4444/ surface.
- helm/charts/browserkube/templates/browserset.yaml: BrowserSet's
  `playwright:` section now reads .Values.playwright{Firefox,Chromium,
  Webkit} instead of hardcoding quay.io/browser/playwright-*. Quoting
  fixed: outer single-quotes around `'{{ ... | default "..." }}'` to
  avoid YAML's nested-double-quote parse error.
- helm/charts/browserkube/values.yaml: added playwrightFirefox/Chromium/
  Webkit blocks; backend tweaks for the new flow.
- backend/cmd/browserkube/internal/playwright/{handler,proxy}.go:
  switch to stdlib httputil.ReverseProxy (Go 1.20+ handles WS upgrade
  cleanly); recording middlewares moved behind PLAYWRIGHT_RECORD env.
- backend/cmd/browserkube/internal/api/handler.go: long-lived
  /playwright-server/{sessionID} attach endpoint that does NOT delete
  the pod on disconnect (fixes one-shot tear-down).
- backend/cmd/sidecar/{main,sidecar_plugin,cdp_relay}.go: new CDP relay
  for Chromium DevTools forwarding.
- operator/api/v1/browser_types.go + zz_generated + CRDs: BrowserConfig
  field on Browser CR; pod_utils mounts a per-session ConfigMap and sets
  BROWSER_ENGINE on the bind-server container.
- skaffold.yaml + Taskfile.yaml: build the three new playwright images.
- .gitignore: exclude backend/sidecar (build artifact).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs surfaced by the first real `make images` run on the Hetzner box:

1. `.PHONY: $(addprefix image-,$(IMAGE_NAMES))` declared image-<name>
   targets explicitly. GNU Make treats those as separate empty targets
   and the pattern rule `image-%` no longer fires for them, so
   `make images` was a no-op for every component. Drop the explicit
   .PHONY for image-* and rely on the pattern rule alone.

2. `_build_one` constructed `-f $(REPO_ROOT)/$$df` even when df was a
   bare `Dockerfile` (no slash) and the actual path is
   `$(REPO_ROOT)/$$ctx/Dockerfile`. Add a case to resolve the dockerfile
   path either as repo-relative (when df contains a slash, e.g.
   `backend/Dockerfile`, `cmds/clipboard/Dockerfile`) or as
   context-relative (when df is just `Dockerfile`).

After these fixes, all 14 images built and imported into k3s containerd
on the test VM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues surfaced during the first real `make deploy` on the Hetzner
box:

1. The Helm chart does not include CRDs (Browser, BrowserSet,
   SessionResult). They live in operator/config/crd/bases/ and have to
   be applied separately. Add a `crds` target that does
   `kubectl apply -f operator/config/crd/bases/` and make `deploy`
   depend on it.

2. The chart's blob storage defaults to RustFS-backed S3
   (blob.rustfs.enabled: true, BLOB_URL pointing at rustfs-svc).
   Without RustFS deployed, the backend crashes at startup. Override
   `blob.rustfs.enabled: false` and point blob.url + blob.archive.url
   at file:///tmp/... so the backend writes to a local filesystem
   inside the pod (ephemeral; sufficient for smoke testing).

3. The UI deployment template hardcodes `replicas: 1`, so
   `--set ui.replicaCount=0` is silently ignored — the UI pod always
   runs. Document the manual `kubectl scale` workaround in README.

Also document the CRD lifecycle (CRDs outlive `helm uninstall`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today the operator added x-server and vnc-server together gated on
EnableVNC. That couples two unrelated concerns:
- x-server (xvfb) is required for headed browsers — bind-server.js
  launches `browserType.launch({ headless: false })`, which exits with
  "Missing X server or $DISPLAY" without xvfb.
- vnc-server is for human debugging via the noVNC viewer — MCP-driven
  Playwright clients never use it.

Split them:
- x-server is now added unconditionally so headed browsers always have
  a display. DISPLAY env on the browser container is unconditional too.
- vnc-server is the only thing gated on EnableVNC.

Also let the create-session API caller opt out of VNC: add
CreateBrowserRequest.EnableVNC (*bool, optional). Defaults to true
when omitted to preserve backwards compatibility — clients that don't
need a debug VNC stream (e.g. testinator-tooling's MCP adapter) send
"enableVNC": false to drop the sidecar.

Result: tooling sessions get a 4-container pod (browser + sidecar +
clipboard + x-server) instead of 5. Saves one container per session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scheduler reservation was 1 CPU / 3 GiB per browser pod, but empirical
RSS on testinator workloads is ~480 MiB / ~350 m CPU per chromium. The
3 GiB request meant a 16 GiB box could only schedule 4 concurrent
browser pods even with plenty of free RAM (5–6 GiB sitting idle).

Drop request to 0.5 CPU / 1 GiB; keep the 4 CPU / 4 GiB limit so heavy
SPAs can still burst. Now ~10 concurrent browsers fit on the same VM
without changing the actual memory consumption pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bind-server.js calls browserType.launch() immediately on startup, with
`headless: false` requiring a working DISPLAY. The x-server runs as a
sibling container in the same pod and starts in parallel — k8s gives
no ordering guarantee. When the browser container starts a moment
ahead of x-server, chromium/firefox/webkit exits with:

    Missing X server or $DISPLAY
    The platform failed to initialize.  Exiting.

Most pods don't hit this because xvfb is fast, but under memory
pressure (more concurrent pods, slower scheduling) the race window
widens. Observed two failures during 8-parallel load testing.

Fix: before launch, poll for /tmp/.X11-unix/X<n> with a 30 s deadline.
Same change for all three playwright bind-server images
(chromium/firefox/webkit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Playwright 1.60.0 went GA today on npm and Microsoft Container Registry
published v1.60.0-noble alongside it. The browser.bind() API we depend
on is present in stable with only a positional-arg rename (name → title),
which doesn't affect our usage.

Switching unlocks two wins:

1. **Image size cut.** Was ~2.6 GB (chromium), ~1.7 GB (firefox/webkit)
   on `node:20-bookworm-slim` + manual apt-get + `playwright install
   --with-deps` + a `cp -r` of the browsers cache to dodge $HOME-path
   issues. Now ~1.4 GB / ~0.9 GB on
   `mcr.microsoft.com/playwright:v1.60.0-noble` — base ships Node 22,
   all three browsers preinstalled at /ms-playwright, all system libs,
   and PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 so the npm `playwright`
   install is metadata-only.

2. **Maintenance posture.** Off the daily-rolling alpha track and onto
   a stable, date-pinnable release. No more risk of the alpha API
   drifting underneath us.

Dropped from each Dockerfile:
  - manual apt-get of libgtk/libnss3/libasound2/libegl1/libgl1/ffmpeg/xauth/…
  - `playwright install --with-deps <browser>` (browsers preinstalled)
  - `cp -r /root/.cache/ms-playwright /.cache/ms-playwright`
    (MCR's /ms-playwright is already accessible to non-root)

Dropped from firefox/package.json:
  - playwright-firefox separate package (the `playwright` umbrella package
    + MCR's preinstalled binary covers it).

Compatibility note: @playwright/mcp@0.0.71 still bundles the 04-27
alpha; mcp hasn't shipped a stable-1.60-paired release yet. Wire
protocol within 1.60.x has held compatible across alphas, so server
on stable + client on late-alpha should work; verify with a smoke
test before promoting in spawner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant