Feat/hetzner deploy design#3
Open
aleksei-okatiev wants to merge 21 commits into
Open
Conversation
Captures the agreed design for running browserkube on a single Hetzner Cloud VM beside testinator's docker-compose stack: k3s with ingress-nginx on hostPort 8080, build-on-host image pipeline imported into containerd, and a path forward to TLS / public auth without rebuilding the cluster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 bite-sized tasks: scaffold deploy/hetzner/, write k3s/ingress/ufw bootstrap scripts, write Helm values + Makefile + README, run end-to-end on the Hetzner box including testinator tooling reconfig and smoke test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…already echo, comment) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…al ingress hosts) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Chart resolves images via .<component>.tag | default .Chart.AppVersion, so the global --set tag=... in Task 5/6 was a no-op. Update Task 6 to add a values-key column to the IMAGES table and generate per-component --set <key>.tag=<sha> flags. Update Task 5 verification expectations and add a git-archive workaround for local validation when the working tree carries broken in-flight chart edits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The IMAGES table was widened to 5 columns (added values-key) in the plan revision, but _build_one's awk stride was left at i+=4. That made `make image-<X>` fail for everything except the first row (browserkube). Tested with sidecar (row 2) and playwright-webkit (row 14): both now resolve correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
helm lint fails on pre-existing chart errors (Chart.yaml apiVersion mismatch, browserkube-quotas.yaml empty name) that aren't this work's to fix. With lint as a deploy prereq, `make deploy` would always fail before reaching helm upgrade. Drop lint from deploy's deps; keep it as a standalone advisory target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In-flight playwright/MCP integration, committed onto the Hetzner deploy
branch so the chart renders against our locally-built playwright-firefox
/ chromium / webkit images instead of upstream quay.io/browser/* (which
predate browser.bind() and break MCP attach).
Changes:
- cmds/playwright-{firefox,chromium,webkit}/: new bind-server.js images.
Each runs `<engine>.launch()` once, calls `browser.bind()`, and proxies
the random-port endpoint through a stable :4444/ surface.
- helm/charts/browserkube/templates/browserset.yaml: BrowserSet's
`playwright:` section now reads .Values.playwright{Firefox,Chromium,
Webkit} instead of hardcoding quay.io/browser/playwright-*. Quoting
fixed: outer single-quotes around `'{{ ... | default "..." }}'` to
avoid YAML's nested-double-quote parse error.
- helm/charts/browserkube/values.yaml: added playwrightFirefox/Chromium/
Webkit blocks; backend tweaks for the new flow.
- backend/cmd/browserkube/internal/playwright/{handler,proxy}.go:
switch to stdlib httputil.ReverseProxy (Go 1.20+ handles WS upgrade
cleanly); recording middlewares moved behind PLAYWRIGHT_RECORD env.
- backend/cmd/browserkube/internal/api/handler.go: long-lived
/playwright-server/{sessionID} attach endpoint that does NOT delete
the pod on disconnect (fixes one-shot tear-down).
- backend/cmd/sidecar/{main,sidecar_plugin,cdp_relay}.go: new CDP relay
for Chromium DevTools forwarding.
- operator/api/v1/browser_types.go + zz_generated + CRDs: BrowserConfig
field on Browser CR; pod_utils mounts a per-session ConfigMap and sets
BROWSER_ENGINE on the bind-server container.
- skaffold.yaml + Taskfile.yaml: build the three new playwright images.
- .gitignore: exclude backend/sidecar (build artifact).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs surfaced by the first real `make images` run on the Hetzner box: 1. `.PHONY: $(addprefix image-,$(IMAGE_NAMES))` declared image-<name> targets explicitly. GNU Make treats those as separate empty targets and the pattern rule `image-%` no longer fires for them, so `make images` was a no-op for every component. Drop the explicit .PHONY for image-* and rely on the pattern rule alone. 2. `_build_one` constructed `-f $(REPO_ROOT)/$$df` even when df was a bare `Dockerfile` (no slash) and the actual path is `$(REPO_ROOT)/$$ctx/Dockerfile`. Add a case to resolve the dockerfile path either as repo-relative (when df contains a slash, e.g. `backend/Dockerfile`, `cmds/clipboard/Dockerfile`) or as context-relative (when df is just `Dockerfile`). After these fixes, all 14 images built and imported into k3s containerd on the test VM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues surfaced during the first real `make deploy` on the Hetzner box: 1. The Helm chart does not include CRDs (Browser, BrowserSet, SessionResult). They live in operator/config/crd/bases/ and have to be applied separately. Add a `crds` target that does `kubectl apply -f operator/config/crd/bases/` and make `deploy` depend on it. 2. The chart's blob storage defaults to RustFS-backed S3 (blob.rustfs.enabled: true, BLOB_URL pointing at rustfs-svc). Without RustFS deployed, the backend crashes at startup. Override `blob.rustfs.enabled: false` and point blob.url + blob.archive.url at file:///tmp/... so the backend writes to a local filesystem inside the pod (ephemeral; sufficient for smoke testing). 3. The UI deployment template hardcodes `replicas: 1`, so `--set ui.replicaCount=0` is silently ignored — the UI pod always runs. Document the manual `kubectl scale` workaround in README. Also document the CRD lifecycle (CRDs outlive `helm uninstall`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today the operator added x-server and vnc-server together gated on
EnableVNC. That couples two unrelated concerns:
- x-server (xvfb) is required for headed browsers — bind-server.js
launches `browserType.launch({ headless: false })`, which exits with
"Missing X server or $DISPLAY" without xvfb.
- vnc-server is for human debugging via the noVNC viewer — MCP-driven
Playwright clients never use it.
Split them:
- x-server is now added unconditionally so headed browsers always have
a display. DISPLAY env on the browser container is unconditional too.
- vnc-server is the only thing gated on EnableVNC.
Also let the create-session API caller opt out of VNC: add
CreateBrowserRequest.EnableVNC (*bool, optional). Defaults to true
when omitted to preserve backwards compatibility — clients that don't
need a debug VNC stream (e.g. testinator-tooling's MCP adapter) send
"enableVNC": false to drop the sidecar.
Result: tooling sessions get a 4-container pod (browser + sidecar +
clipboard + x-server) instead of 5. Saves one container per session.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scheduler reservation was 1 CPU / 3 GiB per browser pod, but empirical RSS on testinator workloads is ~480 MiB / ~350 m CPU per chromium. The 3 GiB request meant a 16 GiB box could only schedule 4 concurrent browser pods even with plenty of free RAM (5–6 GiB sitting idle). Drop request to 0.5 CPU / 1 GiB; keep the 4 CPU / 4 GiB limit so heavy SPAs can still burst. Now ~10 concurrent browsers fit on the same VM without changing the actual memory consumption pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bind-server.js calls browserType.launch() immediately on startup, with
`headless: false` requiring a working DISPLAY. The x-server runs as a
sibling container in the same pod and starts in parallel — k8s gives
no ordering guarantee. When the browser container starts a moment
ahead of x-server, chromium/firefox/webkit exits with:
Missing X server or $DISPLAY
The platform failed to initialize. Exiting.
Most pods don't hit this because xvfb is fast, but under memory
pressure (more concurrent pods, slower scheduling) the race window
widens. Observed two failures during 8-parallel load testing.
Fix: before launch, poll for /tmp/.X11-unix/X<n> with a 30 s deadline.
Same change for all three playwright bind-server images
(chromium/firefox/webkit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Playwright 1.60.0 went GA today on npm and Microsoft Container Registry
published v1.60.0-noble alongside it. The browser.bind() API we depend
on is present in stable with only a positional-arg rename (name → title),
which doesn't affect our usage.
Switching unlocks two wins:
1. **Image size cut.** Was ~2.6 GB (chromium), ~1.7 GB (firefox/webkit)
on `node:20-bookworm-slim` + manual apt-get + `playwright install
--with-deps` + a `cp -r` of the browsers cache to dodge $HOME-path
issues. Now ~1.4 GB / ~0.9 GB on
`mcr.microsoft.com/playwright:v1.60.0-noble` — base ships Node 22,
all three browsers preinstalled at /ms-playwright, all system libs,
and PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 so the npm `playwright`
install is metadata-only.
2. **Maintenance posture.** Off the daily-rolling alpha track and onto
a stable, date-pinnable release. No more risk of the alpha API
drifting underneath us.
Dropped from each Dockerfile:
- manual apt-get of libgtk/libnss3/libasound2/libegl1/libgl1/ffmpeg/xauth/…
- `playwright install --with-deps <browser>` (browsers preinstalled)
- `cp -r /root/.cache/ms-playwright /.cache/ms-playwright`
(MCR's /ms-playwright is already accessible to non-root)
Dropped from firefox/package.json:
- playwright-firefox separate package (the `playwright` umbrella package
+ MCR's preinstalled binary covers it).
Compatibility note: @playwright/mcp@0.0.71 still bundles the 04-27
alpha; mcp hasn't shipped a stable-1.60-paired release yet. Wire
protocol within 1.60.x has held compatible across alphas, so server
on stable + client on late-alpha should work; verify with a smoke
test before promoting in spawner.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is just an dirty attempt to adjust browserkube to agenetic QA needs to be able to connect to tooling