Durable map of the deployment surface. Update in lock-step with chart changes; stale entries are bugs.
Everything a consumer needs to install MEHO onto a Kubernetes cluster lives
under deploy/. The Helm chart at deploy/charts/meho/ is the single
contract between MEHO and the deployment environment — helm install /
helm upgrade --install consumes it and produces the core Kubernetes
resources that make up a running backplane:
- Deployment — the backplane Pod (FastAPI app,
uvicornon port 8000 with--proxy-headersso the cluster's TLS-terminating Ingress'X-Forwarded-Proto: httpssurvives FastAPI's trailing-slash 307 redirects; trusted-proxy CIDR is operator-configurable viaconfig.forwardedAllowIps, seedocs/cross-repo/reverse-proxy-contract.md). - Service — ClusterIP front-door for the Deployment, target port
http. - Ingress — TLS-enabled external entry with cert-manager annotations.
- ConfigMap — non-secret env (Keycloak URLs, Vault address, pool sizes,
FORWARDED_ALLOW_IPSfor the uvicorn proxy-header trust list). - ServiceAccount — Pod identity,
automountServiceAccountToken: false. - NetworkPolicy — default-deny ingress + explicit egress allow-list to Postgres, Vault, Keycloak, the broadcast subchart, and CoreDNS only.
- Migration Job —
pre-install,pre-upgradeHelm hook runningpython -m meho_backplane.db.migratebefore the Deployment rolls forward. - Broadcast subchart — in-tree Valkey 9.x Deployment + Service + ConfigMap per ADR 0005.
deploy/charts/meho/
├── Chart.yaml # apiVersion v2, kubeVersion >=1.28, dependencies: broadcast
├── .helmignore # standard exclusions
├── values.yaml # safe-by-default; required fields are blank
├── values.schema.json # draft-07 typed contract; rejects typos and empty required fields
├── templates/
│ ├── _helpers.tpl # name / fullname / labels / SA helpers
│ ├── deployment.yaml # backplane Pod + probes (/healthz, /ready)
│ ├── service.yaml # ClusterIP :8000
│ ├── ingress.yaml # TLS + cert-manager
│ ├── configmap.yaml # non-secret env
│ ├── serviceaccount.yaml # Pod identity
│ ├── networkpolicy.yaml # default-deny + explicit egress (broadcast egress conditional)
│ ├── migration-job.yaml # pre-install/pre-upgrade Helm hook (alembic upgrade head)
│ └── NOTES.txt # post-install hints
└── charts/
└── broadcast/ # in-tree Valkey 9.x subchart (ADR 0005)
├── Chart.yaml
├── values.yaml
└── templates/
├── _helpers.tpl
├── deployment.yaml # single-replica Recreate; readonly rootfs + emptyDir /data
├── service.yaml # ClusterIP :6379 (port name "redis")
└── configmap.yaml # minimal valkey.conf (no auth, no persistence)
apiVersion: v2— required for Helm 3 / 4.name: meho-chartis the OCI artefact basename.helm pushderives the published package path (ghcr.io/evoila/meho-chart) from this field. The chart is namedmeho-chartrather thanmehoso the GHCR package stays distinct from the backplane image package atghcr.io/evoila/meho— visibility, retention, and signing identities are managed independently on each package. To preserve the existing resource-label invariant (app.kubernetes.io/name: meho) the chart setsnameOverride: mehoinvalues.yaml; the rename is purely a publish- coordinate concern.versionis the chart version (calver-bumped by.github/workflows/chart.ymlto0.1.YYYYMMDD-<short-sha>on main pushes, plain semver onv*tag pushes);appVersionis the application version, overridden by the same workflow to the git sha being deployed. The values shipped inChart.yamlare placeholders — they exist only sohelm lint/helm templatesucceed on a fresh checkout.kubeVersion: ">=1.28.0-0"— matches Goal #11's RKE2 target. The manifests use only API versions that have been stable since 1.19; the floor is set higher than strictly required to align with the test bed.sources+maintainers+keywordsfollow Artifact Hub norms for discoverability now that the OCI publish workflow has landed.
The Deployment renders {{ .Values.image.repository }}:{{ .Values.image.tag }}
with no .Chart.AppVersion fallback. values.schema.json rejects an
empty image.tag (minLength: 1), so every install pins the tag
operator-supplied at helm install / helm upgrade time. Goal #11's
deploy discipline forbids moving references (including a chart-appVersion
shadow), and the chart enforces that contract at the schema layer rather
than relying on consumers to remember to --set image.tag.
image.repository accepts the lowercase OCI grammar including an optional
:<port> segment after the host, so private registries like
registry.example.com:5000/team/meho are valid. The pattern enforces the
shape; the operator picks the value.
imagePullSecrets is a values-configurable list, empty by default. The
backplane image is pushed to public GHCR (Goal #11's locked artefact-
distribution principle), so no pull secret is required in the default
deployment path. Consumers mirroring through a private registry override
image.repository to point at their mirror and (if needed) populate
imagePullSecrets.
The Deployment always renders livenessProbe and readinessProbe against
the backplane chassis endpoints from G2.1-T2
(backend/src/meho_backplane/health.py):
| Probe | Endpoint | Failure semantics | Default timings (operator-tunable) |
|---|---|---|---|
livenessProbe |
/healthz (always 200 if the process is up) |
Pod restarts on failure | initialDelaySeconds: 30, periodSeconds: 10, timeoutSeconds: 1, failureThreshold: 3 |
readinessProbe |
/ready (200 only when every registered probe in the readiness registry passes; 503 with an empty registry at the chassis stage) |
Pod removed from Service endpoints, no restart | initialDelaySeconds: 5, periodSeconds: 5, timeoutSeconds: 2, failureThreshold: 3 |
The 30-second liveness initialDelaySeconds gives the FastAPI app time to
import, build the JWKS cache, and bind structlog context before the first
check — under-provisioning it would restart-loop the Pod during slow image
pulls or cold-start library imports. The shorter readiness window (15s
total detection) makes the Pod fall out of rotation promptly when a
downstream dependency goes flaky, without triggering an unnecessary
restart of the backplane process itself.
Probes are always on — there is no enabled: false escape valve.
Disabling probes would mask startup deadlocks and let an unready Pod
accept traffic; that tradeoff is never the right call for a governance
backplane. Every field under probes.liveness.* and probes.readiness.*
in values.yaml is operator-tunable for environments that need different
timings.
The /ready endpoint returns 503 by design until G2.2 (Vault /
Keycloak probes) and G2.3 (Alembic migration probe) register concrete
probes — that's the fail-closed chassis state, not a bug. During chassis-
stage dev installs the readinessProbe will hold the Pod out of rotation
until those probes land; that's the intended signal.
The Pod runs with runAsNonRoot: true, runAsUser: 1001,
seccompProfile.type: RuntimeDefault; the container drops every
capability, disallows privilege escalation, and mounts the root filesystem
read-only. /tmp is mounted as an emptyDir for libraries that insist on
a writable tempdir. These defaults match cert-manager, ArgoCD, and Flux,
and they're the minimum surface that admits a Pod under the cluster's
restricted PodSecurity profile.
networkPolicy.enabled: true ships a default-deny policy with explicit
egress rules to:
- Postgres —
tcp/5432tonetworkPolicy.postgresCIDR - Vault —
tcp/8200tonetworkPolicy.vaultCIDR - Keycloak —
tcp/443tonetworkPolicy.keycloakCIDR - Broadcast subchart —
tcp/<broadcast.service.port>(default 6379) to apodSelectormatching the in-cluster broadcast subchart's selector labels (app.kubernetes.io/name: broadcast); the rule is conditional onbroadcast.enabled: trueand is omitted when the broadcast subchart is disabled - DNS —
udp/53to thek8s-app: kube-dnsselector (matches CoreDNS)
Ingress is permitted only from the namespace whose
kubernetes.io/metadata.name label matches
networkPolicy.ingressControllerNamespace (default ingress-nginx,
RKE2's bundled controller).
The three egress CIDR fields ship empty in values.yaml and are
required-with-shape-validation in the schema when
networkPolicy.enabled: true. The chart will not render with the
default enabled: true without explicit per-environment CIDR overrides
— defense-in-depth against accidentally allowing a wide subnet because
a typo silently fell through.
Operators on clusters running an equivalent mesh-level policy (Istio
AuthorizationPolicy, Cilium CiliumNetworkPolicy, etc.) can set
networkPolicy.enabled: false to skip the chart's NetworkPolicy
entirely; the schema's conditional if/then relaxes the CIDR
requirements in that mode so the values overlay does not need to
populate them. Disabling without a replacement policy in place removes
the chart's least-privilege egress story — only do it when an
equivalent control is enforced upstream.
A Kubernetes Job runs as a Helm hook before the Deployment is created
(install) or rolled forward (upgrade). The container executes
python -m meho_backplane.db.migrate — the entrypoint shipped by Task
#29 — which invokes alembic upgrade head against the same
DATABASE_URL Secret the backplane Deployment consumes. The Job uses
the same image as the backplane ({{ .Values.image.repository }}:{{ .Values.image.tag }})
so the migrations applied match exactly the revision the rolling-out
Deployment expects — a separate migration image would drift.
Hook semantics:
| Annotation | Value | Meaning |
|---|---|---|
helm.sh/hook |
pre-install,pre-upgrade |
Runs the Job both on a fresh helm install and every helm upgrade |
helm.sh/hook-weight |
"-10" |
Runs ahead of any other hook resources (only documentary at the chassis stage — no other hooks ship yet) |
helm.sh/hook-delete-policy |
before-hook-creation,hook-succeeded |
Overwrites the previous Job on retry; GCs the Job once it exits 0. hook-failed is intentionally absent — failed Jobs stay in the namespace for kubectl logs forensics |
Pod spec:
restartPolicy: OnFailure— retry in-place on transient asyncpg errors without re-scheduling the whole Pod.backoffLimit: 3(operator-tunable via.Values.migrationJob.backoffLimit) — catches transient network blips between the Job pod and PostgreSQL. Alembic migrations are idempotent so re-running a partially-applied step is safe.ttlSecondsAfterFinished: 600(operator-tunable via.Values.migrationJob.ttlSecondsAfterFinished) — Kubernetes-side garbage-collection backstop: even ifhelm uninstallis delayed, the Job + Pod logs are reaped after the configured window (10 minutes by default).- Same
serviceAccountName,imagePullSecrets, and pod/containersecurityContextas the backplane Deployment (runAsNonRoot,readOnlyRootFilesystem,drop: [ALL]), with/tmpmounted as anemptyDirto keep the read-only root invariant. envFromreuses the backplane's ConfigMap so any Alembic-relevant env vars (pool sizes, timeouts) stay in lock-step;DATABASE_URLis pulled fromValues.postgres.credentialsSecretat theurlkey exactly like the Deployment does.
Failure semantics. When the Job exhausts backoffLimit, Helm fails
the release at the pre-install/pre-upgrade hook step. The Deployment is
never created against an unmigrated schema. The failed Job is left in
the namespace; kubectl logs -n <ns> job/<release>-meho-migrate shows
the Alembic error (rendered to stderr by the runner as
migration_failed: <ExcClass>: <msg>).
A custom in-tree Helm subchart deploys Valkey 9.x as the Redis-protocol-compatible activity-broadcast store. ADR 0005 locks the upstream choice and the workload shape; the subchart is the implementation of that decision.
| Aspect | Value | Rationale |
|---|---|---|
| Upstream image | valkey/valkey:9.0-alpine (Docker Hub) |
BSD-3-Clause; Linux Foundation governance; carries Redis 7.2.4's last permissive license forward |
| Slug | broadcast (not redis) |
The protocol contract matters more than the brand |
| Workload | Deployment (not StatefulSet), single replica |
Streams are ephemeral in v0.1; HA via Sentinel/Cluster is v0.2+ |
| Persistence | None (no PVC, save "", appendonly no) |
Restart-loss of stream history is acceptable in v0.1 |
| Auth | None (no requirepass) |
v0.1 single-tenant; gated at the network layer by the umbrella chart's NetworkPolicy |
| Update strategy | Recreate |
Single-replica + port-bind constraint makes RollingUpdate worse |
| Probes | TCP connect on 6379 |
Minimal — avoids coupling to redis-cli / valkey-cli binary naming variance |
| Service | ClusterIP <release>-broadcast:6379 (port name redis) |
In-cluster only; backplane consumes via the operator-facing BROADCAST_REDIS_URL env |
The subchart lives unpacked at deploy/charts/meho/charts/broadcast/.
The parent Chart.yaml declares it as a dependency with
repository: "" (the documented Helm shape for an unpacked local
subchart — helm dependency update is not required and would fail
trying to fetch from a remote registry). condition: broadcast.enabled
lets operators flip the entire subchart off with a single boolean — for
example, on clusters where an external managed Valkey/Redis (Azure
Cache, AWS ElastiCache, GCP Memorystore) is already available. The
v0.2+ broadcast.externalEndpoint opt-out lands when the broadcast
feature actually carries cross-deployment streams.
Schema interaction with the parent. The umbrella chart's
values.schema.json declares a broadcast property with
additionalProperties: false plus an explicit list of permitted keys,
plus a permissive global property because Helm injects
.Values.global into every subchart's values namespace. Without the
global allowance, helm lint reports
at '/broadcast': additional properties 'global' not allowed. The
subchart's own values.yaml shape is also enforced by Helm independently —
this parent block is the surface visible to the umbrella's --set flags.
Backplane wiring. When broadcast.enabled: true (the default), the
backplane Deployment renders a BROADCAST_REDIS_URL env var pointing at
redis://{{ .Release.Name }}-broadcast:{{ .Values.broadcast.service.port }}/0.
The env-var name is load-bearing: Settings.broadcast_redis_url
(backend/src/meho_backplane/settings.py) resolves from
BROADCAST_REDIS_URL and falls back to redis://localhost:6379 when it
is unset, so a chart that injects any other name (the v0.2 REDIS_URL
mismatch fixed in #583) leaves the readiness probe's broadcast leg
dialing localhost while the healthy Service is never contacted. ADR 0005
locked redis-py as the driver — it parses redis:// schemes against a
Valkey endpoint unchanged (wire-protocol compatibility carries from
Redis 7.2.4).
Operator-supplied secrets. The Job + the backplane both consume
DATABASE_URL from a Kubernetes Secret named by
postgres.credentialsSecret at key url. The chart references this
Secret by name only — provisioning it is the operator's job. Production
deployments use External Secrets Operator (ESO) to sync the value
from HashiCorp Vault (G2.5-T4 ships the example overlay; G2.6 wires the
ESO ExternalSecret resources). Dev installs may pre-provision the Secret
manually:
kubectl create secret generic meho-postgres \
--from-literal=url='postgresql+asyncpg://meho:<password>@<host>:5432/meho' \
--namespace mehovalues.yaml deliberately ships blank for every field the backplane
cannot start without. Operators MUST override these via --set or a
values overlay:
| Field | Why blank |
|---|---|
image.tag |
Goal #11 deploy discipline: every install pins an immutable tag, never a moving reference |
ingress.host |
Per-environment; no generic placeholder is correct. Required only when ingress.enabled: true (the default) — relaxed when ingress is disabled |
ingress.tls.secretName |
Per-environment Secret name (cert-manager-managed or pre-provisioned). Required only when both ingress.enabled and ingress.tls.enabled are true |
postgres.credentialsSecret |
Per-environment Secret holding DATABASE_URL (ESO-synced from Vault in production) |
vault.address |
Per-environment Vault endpoint |
keycloak.issuer |
Per-environment Keycloak issuer URL |
config.keycloakIssuerUrl / config.keycloakAudience / config.vaultAddr |
ConfigMap env-var mirrors of the above (backend/src/meho_backplane/settings.py contract) |
config.backplaneUrl / config.mcpResourceUri |
G0.8-T4 (#633). Blank by design: for the common ingress-fronted deploy the chart derives BACKPLANE_URL=https://<ingress.host> (scheme follows ingress.tls.enabled) and MCP_RESOURCE_URI=${BACKPLANE_URL}/mcp via the meho.backplaneUrl / meho.mcpResourceUri helpers, so the /mcp audience resolves without operator action. Set explicitly only when the public URL differs from the Ingress host, or for a non-default MCP mount. When neither resolves (no ingress, nothing set) the backend fails loudly at startup with the remediation rather than serving a dark /mcp (_assert_mcp_resource_uri_configured in main.py). The operator must still add a matching Keycloak oidc-audience-mapper — see docs/cross-repo/mcp-client-setup.md Step 1 |
networkPolicy.{postgres,vault,keycloak}CIDR |
Per-environment subnet for each upstream. Required only when networkPolicy.enabled: true (the default) — relaxed when networkPolicy is disabled |
A blank field falls into the typed-schema contract immediately — helm install fails before a single Kubernetes resource is created. The
operator sees the exact missing path (e.g. at '/vault/address': minLength: got 0, want 1) and a single targeted override fixes it.
Conservative resource defaults (requests: {cpu: 100m, memory: 256Mi},
limits: {cpu: 1000m, memory: 1Gi}) reflect observed steady-state usage
of the v0.1 chassis (authn/authz traffic + synchronous audit-write fanout);
tune limits up for higher-throughput deployments.
The complete operator-facing values surface. These two tables are the authoritative reference (the README links here rather than duplicating them).
Operator-required (MUST be set; the schema rejects empty defaults):
| Path | Type | Notes |
|---|---|---|
image.tag |
string | Immutable tag (sha-<git-sha> or v<x.y.z>); never :latest. |
ingress.host |
string (hostname) |
External hostname the chart publishes. Required only when ingress.enabled: true (default); skipped when ingress is disabled. |
ingress.tls.secretName |
string | TLS Secret (cert-manager-managed or pre-provisioned). Required only when both ingress.enabled and ingress.tls.enabled are true. |
postgres.credentialsSecret |
string | Kubernetes Secret holding DATABASE_URL at key url. |
vault.address |
string (uri) |
Vault endpoint, e.g. https://vault.example.org. |
keycloak.issuer |
string (uri) |
Keycloak issuer URL (used for iss validation + JWKS discovery). |
config.keycloakIssuerUrl |
string | ConfigMap mirror of the above; consumed by the backplane env. |
config.keycloakAudience |
string | Keycloak client ID fronting the backplane. |
config.vaultAddr |
string (uri) |
ConfigMap mirror of vault.address. |
networkPolicy.postgresCIDR |
CIDR (IPv4) | Egress CIDR; pattern-validated. Required only when networkPolicy.enabled: true (default). |
networkPolicy.vaultCIDR |
CIDR (IPv4) | Same. |
networkPolicy.keycloakCIDR |
CIDR (IPv4) | Same. |
Common operator overrides (safe defaults provided; tune as needed):
| Path | Default | Notes |
|---|---|---|
replicaCount |
1 |
Single-replica baseline. |
image.repository |
ghcr.io/evoila/meho |
OCI repo from the image pipeline. |
image.pullPolicy |
IfNotPresent |
Always | IfNotPresent | Never. |
service.type / service.port |
ClusterIP / 8000 |
Service shape. |
ingress.className |
"" |
Cluster default IngressClass when empty. |
probes.liveness.* / probes.readiness.* |
/healthz / /ready httpGet + tuned timings |
Operator-tunable; never disabled. |
resources.requests / resources.limits |
100m/256Mi / 1000m/1Gi |
Conservative chassis baselines. |
networkPolicy.ingressControllerNamespace |
ingress-nginx |
RKE2 default; override per cluster. |
audit.postgresOnly |
true |
Postgres-only audit sink baseline. |
broadcast.enabled |
true |
Deploys the bundled Valkey broadcast subchart. |
connectors.enabled |
[] |
Opt-in list; pick from the shipped connector catalog (see docs/architecture/connectors.md — VMware/VCF, NSX, Kubernetes, Vault, Harbor, Keycloak, ArgoCD, GCloud, BIND9, pfSense, and more). |
The chart ships a JSON Schema draft-07 contract for values.yaml
(Helm's supported dialect). Helm validates the merged .Values object
against this schema on:
helm linthelm templatehelm install/helm install --dry-runhelm upgrade
Three properties make this the right contract:
additionalProperties: falseat every object level. A typo (postgressforpostgres) fails athelm installtime with the exact path, not silently at first request when the backplane fails to resolve a Vault secret. Helm reports e.g.at '': additional properties 'postgress' not allowed.minLength: 1on every required-but-blank field plusformat: uri/format: hostname/pattern: …shape validation on URLs / hostnames / CIDRs. The safe-by-default empty placeholders invalues.yamlare intentionally rejected, surfacing the exact field the operator must override.- Subchart compatibility. The umbrella's
propertiesmap declares abroadcastkey for the in-tree subchart atcharts/broadcast/, and the subchart's ownvalues.schema.json(if shipped) is also enforced by Helm independently — the parent chart cannot circumvent subchart restrictions. A permissivebroadcast.globalallowance is required because Helm injects.Values.globalinto every subchart's values namespace; omitting it causesat '/broadcast': additional properties 'global' not allowed.
helm lint against the unmodified values.yaml deliberately fails
with the safe-by-default empty fields. The chart's validate job in
.github/workflows/chart.yml and
deploy/values-examples/values-rdc-example.yaml
both supply the required overrides; ad-hoc lint invocations pass them via
--set or -f.
A sanitized example values file lives at
deploy/values-examples/values-rdc-example.yaml.
It targets the RDC Hetzner dogfooding lab shape (cluster-internal
Postgres + Vault + Keycloak on *.evba.lab, rke2-infra ingress-nginx) and
is structured so that other Vault-+-Keycloak-+-Postgres-shaped labs can
copy it, substitute the <REPLACE: ...> placeholders, and apply it.
The placeholders are deliberate: every site-specific field
(image.tag, the Keycloak realm in config.keycloakIssuerUrl /
keycloak.issuer, the three NetworkPolicy CIDRs) is left as a
<REPLACE: ...> literal that fails the schema's format: uri /
format: hostname / IPv4-CIDR pattern at helm install time. A
forgotten substitution surfaces as at '/networkPolicy/postgresCIDR': '<REPLACE: ...>' does not match pattern … instead of silently rendering
a NetworkPolicy that allows everything (or nothing).
The actual values-rdc.yaml for the dogfooding consumer is environment-
private and lives in
evoila-bosnia/claude-rdc-hetzner-dc's
manifests/meho/values-rdc.yaml per Goal #11 cross-repo deps; the
example here is the public template.
External Secrets Operator (ESO) sync patterns. The chart references
operator-provisioned Kubernetes Secrets by name
(postgres.credentialsSecret and, in v0.2, a Keycloak client-secret
Secret). It does not ship a Secret template or accept secret values
via --set. The recommended sync mechanism is
ESO with the upstream store of the
operator's choice (the RDC lab uses
HashiCorp Vault).
Two resources combine to materialise a chart-consumable Secret:
ClusterSecretStore(cluster-scoped pointer at Vault, carrying the auth credentials). Created once per cluster by the platform team. Owned by the consumer's GitOps repo, not this chart — it outlives any release and embeds cluster-level Vault credentials.ExternalSecret(namespaced resource that pulls keys out of the upstream store into a target k8s Secret). Two ownership options:- Default (consumer-managed): the consumer's GitOps repo applies the ExternalSecret alongside or before the chart. The chart references the resulting Secret by name. This is the RDC convention.
- Opt-in (chart-managed): flip
eso.enabled: truein values and the chart renders the ExternalSecret(s) itself viatemplates/externalsecrets.yaml. The schema requireseso.secretStore.{name,kind}wheneso.enabled: true, so a misconfigured opt-in fails at install. With the defaulteso.enabled: false,helm template ... | grep -c ExternalSecretreturns0.
The full ExternalSecret + ClusterSecretStore manifests, the Vault KV
path mapping (secret/meho/postgres → DATABASE_URL, etc.), and the
end-to-end install ordering (ESO → ClusterSecretStore → ExternalSecret →
wait-for-Secret → helm install) are in
deploy/values-examples/README.md.
The recommended flow uses the example values file rather than long
--set strings:
# Copy + substitute the example into your private deploy repo first
# (see deploy/values-examples/README.md).
helm upgrade --install meho ./deploy/charts/meho/ \
--namespace meho \
--create-namespace \
--set image.tag=sha-<git-sha> \
-f values-rdc.yamlThe bare---set equivalent (no values file) — useful for CI smoke
tests:
helm install meho ./deploy/charts/meho/ \
--namespace meho \
--create-namespace \
--set image.tag=sha-<git-sha> \
--set ingress.host=meho.example.org \
--set ingress.tls.secretName=meho-tls \
--set postgres.credentialsSecret=meho-postgres \
--set vault.address=https://vault.example.org \
--set keycloak.issuer=https://keycloak.example.org/realms/meho \
--set config.keycloakIssuerUrl=https://keycloak.example.org/realms/meho \
--set config.keycloakAudience=meho-backplane \
--set config.vaultAddr=https://vault.example.org \
--set networkPolicy.postgresCIDR=10.0.1.0/24 \
--set networkPolicy.vaultCIDR=10.0.2.0/24 \
--set networkPolicy.keycloakCIDR=10.0.3.0/24Missing any required override fails the schema validation at install
time with the exact field path — e.g. omitting --set vault.address=...
produces at '/vault/address': '' is not valid uri: relative url. The
backplane never starts against a misconfigured set of values.
# helm lint passes only with a values overlay or `--set` overrides for every
# required-but-blank field; the bare chart deliberately fails-loud:
helm lint deploy/charts/meho/ \
--set image.tag=test \
--set ingress.host=meho.test \
--set ingress.tls.secretName=meho-tls \
--set postgres.credentialsSecret=meho-postgres \
--set vault.address=https://vault.test \
--set keycloak.issuer=https://keycloak.test/realms/meho \
--set config.keycloakIssuerUrl=https://keycloak.test/realms/meho \
--set config.keycloakAudience=meho-backplane \
--set config.vaultAddr=https://vault.test \
--set networkPolicy.postgresCIDR=10.0.0.0/24 \
--set networkPolicy.vaultCIDR=10.0.0.0/24 \
--set networkPolicy.keycloakCIDR=10.0.0.0/24
# Same flags reproduce the render:
helm template test-release deploy/charts/meho/ \
--set image.tag=test \
--set ingress.host=meho.test \
--set ingress.tls.secretName=meho-tls \
--set postgres.credentialsSecret=meho-postgres \
--set vault.address=https://vault.test \
--set keycloak.issuer=https://keycloak.test/realms/meho \
--set config.keycloakIssuerUrl=https://keycloak.test/realms/meho \
--set config.keycloakAudience=meho-backplane \
--set config.vaultAddr=https://vault.test \
--set networkPolicy.postgresCIDR=10.0.1.0/24 \
--set networkPolicy.vaultCIDR=10.0.2.0/24 \
--set networkPolicy.keycloakCIDR=10.0.3.0/24 \
> /tmp/rendered.yaml
grep -c '^kind:' /tmp/rendered.yaml # expect >= 6
# Negative tests — the chart fails-loud on the misuse cases the schema covers:
helm template test deploy/charts/meho/ 2>&1 | grep -E "minLength|valid"
helm template test deploy/charts/meho/ --set bogus.field=x 2>&1 | grep "additional properties"The chart is packaged, pushed to OCI, and cosign-signed by
.github/workflows/chart.yml. The workflow targets meho-runners-ci (the
project's self-hosted runner pool on the dedicated rke2-ci cluster,
introduced by #160 + #167 on rke2-meho and migrated to rke2-ci via
claude-rdc-hetzner-dc#610 / #715) and shares the
hardening conventions of image.yml (Task #33): job-level fork-PR guard,
SHA-pinned actions with # vX.Y.Z comments, minimum permissions: block
(contents: read, packages: write, id-token: write), per-job
timeout-minutes, and a concurrency: group that cancels stale runs per
ref.
| Trigger | Jobs run | Side effects |
|---|---|---|
pull_request against main (same-repo PRs only) |
validate |
Lint + render + kubeconform; no push, no sign |
push to main (chart paths) |
validate -> publish -> verify-anonymous-pull |
Push at calver 0.1.YYYYMMDD-<short-sha>, cosign-sign, anonymous-pull check |
push of a v* tag |
validate -> publish -> verify-anonymous-pull |
Push at plain semver <x.y.z> (leading v stripped), cosign-sign, anonymous-pull check |
| Fork PR | (skipped at job level) | None — head.repo.full_name != github.repository short-circuits |
The push: block intentionally has no paths: filter — path filtering
applies to both branch and tag pushes when set, which would silently skip
a v* tag annotating a non-chart commit. Releases always publish; the
cost of an occasional chart re-publish on a non-chart main push is
negligible.
Inline Python (with the standard-library + PyYAML on the runner) reads
Chart.yaml, rewrites version and appVersion in place, and re-dumps
the file before helm package runs. The chart's name field stays
meho-chart (set permanently in-tree); only version and appVersion
are workflow-stamped. The post-stamp cat of Chart.yaml lands in the
workflow log so operators can confirm the published metadata.
helm push <tgz> oci://ghcr.io/evoila lands the artefact at
ghcr.io/evoila/meho-chart:<version> because Helm derives the basename
from Chart.yaml's name field. The push step parses the Digest: sha256:... line from Helm's stdout and exposes it as a step output;
cosign sign --yes "ghcr.io/evoila/meho-chart@<digest>" then signs the
chart by digest under the same keyless OIDC identity as the image — the
operator-facing verification command shape is identical between the two
packages (one workflow file path differs from the other, matched by the
--certificate-identity-regexp).
A dedicated verify-anonymous-pull job runs after publish on main / tag
pushes. It installs Helm in a fresh job context and intentionally does
not call helm registry login. The job also scrubs any stale
HELM_REGISTRY_CONFIG left over from a prior run on the same self-hosted
runner. helm pull oci://ghcr.io/evoila/meho-chart --version <ver> from
that scrubbed environment can only succeed if the GHCR package is public
— a successful pull is the DoD signal.
GHCR creates a new package PRIVATE by default. The first time
chart.yml pushes to ghcr.io/evoila/meho-chart, the
verify-anonymous-pull job will fail with unauthorized until a
maintainer flips visibility to public once:
gh api --method PATCH /orgs/evoila/packages/container/meho-chart \
-f visibility=public(Or via the GHCR UI: org -> Packages -> meho-chart -> Package settings ->
Change visibility -> Public.) The workflow itself cannot do this safely
from CI — visibility is org-scoped and changing it from a workflow would
require a PAT with org-admin scope, which the GITHUB_TOKEN lacks. The
image package at ghcr.io/evoila/meho had the same one-time gate
documented in image.yml's header comment.
The published-chart's verification commands live in
backend/README.md (sections "Verifying chart
signatures" and "Pulling the chart anonymously"), alongside the image's
equivalent commands so an operator learns one verification pattern for
both artefacts. The workflow itself also emits the verification block
into GITHUB_STEP_SUMMARY on every successful publish.
ci.yml is the central per-PR test harness. Every PR targeting main
runs four jobs in parallel and every push to main re-runs the same
matrix as a regression catch. Branch protection consumes each job's
status as a required check (per
branches/main/protection.required_status_checks.contexts —
re-verified after the 2026-05-20 #698 promotion of the integration
lane, the structural corrective to the v0.2 / G3.4 green-but-hollow
incidents #634 / #697).
ci.yml triggers on merge_group in addition to pull_request and
push. The merge_group event fires when a PR is admitted to the
GitHub merge queue and runs the full check matrix against the
synthesised merge commit — PR head + current main tip + any
PRs ahead in the queue. A merge that would break main fails in the
queue and never reaches main, ending the inherited-red episodes from
2026-05-20/21 where cancelled post-merge CI allowed broken combinations
to land silently.
Merge-queue setup (admin action, separate from this code change):
- Enable "Require merge queue" in the repository's branch-protection
ruleset for
main(Settings → Rules → Branches → protect main → add "Require merge queue" rule, or viagh api -X PUT repos/evoila/meho/rulesets/14556458 ...). - Configure merge-queue required checks. The full set required by
branch protection on
mainspans four workflows; mirror the same set in the merge-queue ruleset so the queue enforces the same bar against the actual merge result, not just the PR's own head:- From
ci.yml:Python (ruff + mypy + pytest),Python (integration testcontainers),Go (golangci-lint + go test),Helm (lint + template + kubeconform). - From
security-scan.yml:Semgrep SAST. - From
secret-scan.yml:TruffleHog Secret Scan. - From
dependency-license-check.yml:Python License Check,NPM License Check. Both jobs no-op viahashFiles()when the PR doesn't touch a manifest, so they report cheap green on unrelated PRs — but they MUST run on every queue admission so branch protection's required-context list stays satisfiable.
- From
- The
merge_grouptriggers inci.yml,security-scan.yml,secret-scan.yml, anddependency-license-check.ymlare the code-side prerequisite for step 2 — without each sibling workflow subscribing tomerge_group, its required context would never report on queue runs and the queue would hang on missing checks.
Concurrency note: cancel-in-progress is conditional on
github.event_name != 'merge_group'. A cancelled queue check causes
the merge attempt to fail and the PR falls out of the queue — so
merge-queue runs are never cancelled. PR force-pushes and rapid main
commits still cancel their own prior runs as before.
| Job | Surface | Steps |
|---|---|---|
python-lint-test (Python (ruff + mypy + pytest)) |
backend/ unit + acceptance subtree |
uv sync --locked --all-groups -> ruff check -> ruff format --check -> mypy --strict -> pytest -n 6 --dist loadscope (excludes tests/integration/; COVERAGE_CORE=sysmon --cov=meho_backplane --cov-report=xml on both push and PR — the per-test re-embedding cost that made --cov prohibitive on PRs was eliminated in #799; sysmon PEP 669 backend — #739) -> upload python-coverage artefact |
python-integration (Python (integration testcontainers)) |
backend/tests/integration/ |
uv sync --locked --all-groups -> pytest tests/integration/ against pgvector / valkey / k3d / vcsim / vault testcontainers via DinD. Required merge gate (#698) so the lane that exercises real connector dispatch can no longer ship red. |
go-lint-test (Go (golangci-lint + go test)) |
cli/ |
golangci-lint (v6 action) -> go build ./... -> go test -race -cover ./... |
helm-lint-template (Helm (lint + template + kubeconform)) |
deploy/charts/meho/ |
helm lint -> helm template -> kubeconform --strict --kubernetes-version 1.28.0 |
python-lint-test runs on meho-runners-ci-heavy (dedicated ARC scale
set, 6000m requests=limits, max 5 pods — #761 / rdc-gitops#55). The
other three jobs (python-integration, go-lint-test,
helm-lint-template) run on the dense meho-runners-ci pool (4-core).
python-lint-test carries a 20-minute timeout-minutes (retuned from
the legacy 50-min cap after #799 dropped the unit-job wall to ~9 min;
the hard cap stays well above the observed wall for hang detection while
the perf-budget-guard step enforces the 10-min Goal #11 budget at the
PR level). go-lint-test and helm-lint-template carry 10 minutes;
python-integration carries 60 minutes for the container-pull + DinD
spin-up + testcontainers sweep (xdist loadgroup parallelisation tracked
in #564). Wall-clock for a green PR is the slowest job's elapsed time
because the four jobs never block each other — python-integration
typically dominates and is the dispatch surface for the Goal #11 budget
conversation.
No step in ci.yml carries continue-on-error: true except the Python
coverage artefact upload. Linters, formatters, type checkers, tests,
and the kubeconform schema validation are all allowed to fail the job.
The artefact upload is the only soft-fail: losing it degrades the
SonarCloud signal (no coverage for the run) but never invalidates the
test outcome, and quality-gate.yml already guards its own
actions/download-artifact with continue-on-error for the same
reason.
ci.yml deliberately does not build the backplane container image.
image.yml runs on PRs that touch
backend/** or .github/workflows/image.yml (path-filtered, see
image.yml's on.pull_request.paths) with push: false — the
Dockerfile + dep-resolution gate. PRs that don't touch the backend
(chart-only, CLI-only, docs-only) skip the image build by design,
because the gate's inputs haven't changed and rebuilding would add zero
signal. Repeating the build in ci.yml would double the cost for
backend PRs and pointlessly run the gate for the non-backend PRs that
image.yml already filters out. The same reasoning applies to the
chart publish (chart.yml runs validate on PRs as a path-scoped
gate) — ci.yml exercises a parallel helm lint/helm template/
kubeconform pass unconditionally so a chart-touching regression also
fails the central CI check, but it does not duplicate the publish
path. Migration backward-compat (migration-compat.yml), dependency
license scan (dependency-license-check.yml), secret scan
(secret-scan.yml), and the SAST stack (security-scan.yml) all stay
in their dedicated workflows.
On pushes to main, the Python job runs
COVERAGE_CORE=sysmon uv run pytest ... --cov=meho_backplane --cov-report=xml tests/
and uploads backend/coverage.xml as the python-coverage artefact.
quality-gate.yml listens
on workflow_run: workflows: ["CI"], downloads that exact artefact
name via actions/download-artifact@v4, and feeds the XML into the
SonarCloud scan. The workflow name (CI) and the artefact name
(python-coverage) are the load-bearing contract between the two
workflows — changing either side without the other would silently lose
coverage reporting in SonarCloud.
--cov runs on both push and PR as of the post-#799 state. #726
originally gated --cov to push-only on the belief that pytest-cov
was the unit job's dominant cost; the #771 diagnostic (#793) disproved
that — the real cost was per-test descriptor re-embedding (fixed in
#799), not coverage instrumentation. With the embedding re-fetch
eliminated and sysmon's overhead, --cov adds only ~1 min (pytest
~8m33s with cov vs ~7m35s without; run 26245676016), so the unit job
stays ~9.3 min — under the Goal #11 10-min budget — while PRs gain
SonarCloud Clean-as-You-Code new-code-coverage decoration on every PR
instead of updating one merge late. The quality-gate.yml whole-job
continue-on-error means a missing or late artefact never blocks a
merge. COVERAGE_CORE=sysmon (#739) swaps coverage.py's default C
tracer for the PEP 669 sys.monitoring backend (Python 3.12+,
supported by coverage.py 7.4+; the lockfile pins 7.14). Sysmon's
event-driven model removes most of the per-line tracing tax; line
counts matched the C tracer exactly (2913/11832 in both), so the
SonarCloud signal is unaffected.
Every job carries the same if: guard the publish workflows use:
if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repositoryThis is defence in depth on top of branch protection — the
meho-runners-ci self-hosted runner pool is internal infrastructure, so
arbitrary code from a forked PR (go test, pytest, helm template
with custom values) is never allowed to execute on it. The OR
short-circuits on push events so main-branch CI is unaffected.
Every gate the workflow runs can be reproduced locally with the same commands. From the repo root:
Each toolchain command runs in its own subshell so cd never leaks
between sections — copy-paste the whole block and every command lands
in the correct subdir on its own.
# Python
(cd backend && uv sync --locked --all-groups)
(cd backend && uv run ruff check src/ tests/)
(cd backend && uv run ruff format --check src/ tests/)
(cd backend && uv run mypy src/)
(cd backend && uv run pytest -x --cov=meho_backplane --cov-report=term tests/)
# To mirror the CI coverage run (both push and PR, Python 3.12+):
# (cd backend && COVERAGE_CORE=sysmon uv run pytest -n 6 --dist loadscope --maxfail=1 \
# --ignore=tests/integration --cov=meho_backplane --cov-report=xml tests/)
# Go
# CGO_ENABLED=1 is required for `go test -race` — same reason ci.yml
# sets it on the race step. The build/lint steps don't need cgo.
(cd cli && golangci-lint run)
(cd cli && go build ./...)
(cd cli && CGO_ENABLED=1 go test -race -cover ./...)
# Helm + kubeconform (run from repo root)
helm lint deploy/charts/meho/ <same --set overrides as ci.yml>
helm template test deploy/charts/meho/ <same --set overrides> > /tmp/rendered.yaml
kubeconform -strict -kubernetes-version 1.28.0 -ignore-missing-schemas -summary /tmp/rendered.yamlThe --set override block is the same one this doc's ## Verification
section above documents — ci.yml, chart.yml, and the operator
copy-paste in backend/README.md all keep
it in sync intentionally. Any drift means one of the three is wrong.
All third-party actions in ci.yml are pinned to immutable SHAs with
the human-readable tag in a trailing comment. The SHAs match the ones
the publish workflows use where the action overlaps
(actions/checkout, actions/setup-go, azure/setup-helm,
actions/upload-artifact), so a single supply-chain audit covers all
of CI. Two actions are unique to ci.yml:
astral-sh/setup-uv@v8.1.0— the uv installer, withenable-cache: truekeyed byuv.lock.golangci/golangci-lint-action@v6.5.2— pinned to the v6 major, not v7+/v8+, because the in-treecli/.golangci.ymlis written in the golangci-lint v1 config schema (disable-all+ explicitenable:list, v1 linter names likegosimple/errcheck). golangci-lint-action v7.x+ defaults to the v2 binary which rejects v1 configs; pinning the binary tov1.64.8keeps the existing config valid. A future migration to the v2 schema flips both the action major and the binary version together.
pr-smoke.yml is the per-PR ephemeral-cluster discipline Goal #11 DoD
bullet 4 hangs off: every PR against main builds a PR-tagged
backplane image, deploys the chart into a fresh meho-ci-<pr-number>
namespace on the consumer-operated rke2-infra Kubernetes cluster
(claude-rdc-hetzner-dc), runs scripts/ci/pr-smoke.sh, and tears the
namespace down regardless of smoke outcome. It is the inversion of
MEHO.X's failure mode — every code path that ships through G2.0–G2.6
closes the real-target feedback loop on a real Kubernetes API before
merge, not against mocks.
| Property | Value |
|---|---|
| Event | pull_request_target against main (opened, synchronize, reopened) |
| Runner | meho-runners-ci (self-hosted) |
| Concurrency group | pr-smoke-${{ github.event.pull_request.number }} with cancel-in-progress: true |
| Permissions | contents: read, packages: write, id-token: write, pull-requests: write |
| Job timeout | 12 min (8 min smoke budget + cold-cache headroom — Task #50 AC #5) |
pull_request_target (not pull_request) is the load-bearing choice:
GitHub Actions executes the workflow file from main, not from the PR
head ref. That trigger is documented for "needs org secrets / OIDC on
PRs" precisely because the workflow body the runner executes is the
trusted base-branch version, not whatever the PR author pushed. The
job-level fork-PR guard (same shape as ci.yml / image.yml /
chart.yml) then skips fork PRs entirely so no untrusted Dockerfile
or shell ever executes on the self-hosted runner pool with secret
access. Same-repo PRs run with full secret + OIDC access; the PR head
SHA is checked out by SHA (not by ref) so a force-push during an
in-flight run cannot inject newer code after the secret-access gate
fired. The pull_request_target hardening pattern lifted from
securitylab.github.com/research/github-actions-preventing-pwn-requests/.
cancel-in-progress: true on the per-PR concurrency group means an
author push during an in-flight run cancels the prior run; the
always-teardown step on the cancelled run still fires (GitHub Actions
runs if: always() steps even on cancellation) so the namespace is
reclaimed. Two PRs share no concurrency group, so the runner pool's
smoke capacity scales linearly with PR throughput.
Cluster auth + RBAC for meho-ci-* namespaces is provisioned on
evoila-bosnia/claude-rdc-hetzner-dc,
not in this repo — see
docs/cross-repo/rke2-infra-coordination.md
for the full contract (Section 1: auth options; Section 2: RBAC verb
set; Verification: end-to-end check). The consumer-side tracker is
evoila-bosnia/meho-internal#53
(G2.7-T5).
The workflow ships now and fails-skip (skipped at job level, not red) while the consumer side is still being provisioned. Once auth lands, the gate flips on automatically — no workflow edit required. The gate is:
if: |
(github.event.pull_request.head.repo.full_name == github.repository) &&
(vars.RKE2_SMOKE_ENABLED == 'true')GitHub Actions does not expose the secrets context to job-level
if: expressions (only github, needs, vars, and inputs are
available — see the
GitHub Actions contexts reference).
A clause like secrets.RDC_KUBECONFIG != '' therefore collapses to
undefined at evaluation time and the gate silently breaks. The single
repository-scoped vars.RKE2_SMOKE_ENABLED is the documented gate;
maintainers flip it to 'true' after the consumer-side auth (Task
#53) lands. The "Build kubeconfig" step still inspects
env.RKE2_CA_CERT / env.RDC_KUBECONFIG to pick Option A vs Option B
at step level (where the env indirection makes secrets.* legal).
Two auth modes are supported, matching the cross-repo doc's Sections 1A and 1B:
| Mode | Required from consumer | Selected when |
|---|---|---|
| Option A (OIDC trust) | RKE2_CA_CERT secret + RKE2_API_SERVER var + apiserver --oidc-issuer-url=https://token.actions.githubusercontent.com (or AuthenticationConfiguration) |
RKE2_CA_CERT is set AND RKE2_API_SERVER is set |
| Option B (kubeconfig) | RDC_KUBECONFIG secret (base64-encoded SA kubeconfig) |
Only RDC_KUBECONFIG is set |
When both are set, Option A wins (the cross-repo doc's preference
order: short-lived OIDC tokens vs. a long-lived stored kubeconfig).
When neither is set, the "Build kubeconfig" step errors out — but
the job-level vars.RKE2_SMOKE_ENABLED gate is the upstream guard
that prevents that path from ever running while the consumer side
is still unrolled-out.
The repository-scoped vars.RKE2_SMOKE_ENABLED == 'true' clause is the
single enable knob for maintainers: once the consumer side rolls
out and the secrets are provisioned, set the variable to true and
the gate flips on for every subsequent PR. Useful for the 5 consecutive green smokes Goal #11 window — flip the var, queue 5 PRs,
count. While the variable is unset (or 'false'), the workflow is
skipped at job level (not failure), so PRs are never blocked on
the unrolled-out consumer side.
A single job (smoke) with sequential steps. Splitting into multiple
jobs (build / deploy / smoke / teardown) would force per-job runner
startup overhead onto every PR's 8-minute budget, and if: always()
would have to traverse needs: edges with explicit
if: ${{ always() && needs.deploy.result != 'skipped' }} plumbing —
a single-job layout keeps the teardown invariant trivially correct
("always() runs even on cancel").
Steps, in order:
- Checkout PR head SHA —
actions/checkout@v6.0.2against${{ github.event.pull_request.head.sha }}(nothead.ref). - QEMU + buildx + GHCR login — same action SHAs as
image.yml. - Build + push PR-tagged image —
docker/build-push-action@v7.1.0pushingghcr.io/evoila/meho:pr-<n>-<sha>. The<sha>suffix makes the tag immutable per push: a force-push to the PR branch produces a NEW tag (new sha), so the deploy step always pulls the build the smoke is about to assert against. amd64-only — the per-PR feedback loop optimises for time, not for proving multi-arch (that invariant belongs toimage.ymlon main-push). - Install kubectl + Helm —
azure/setup-kubectl@v5.1.0(v1.28.15 tracking the chart'skubeVersion: ">=1.28.0-0"floor) +azure/setup-helm@v4.3.1(same SHA aschart.yml). - Configure kubectl (OIDC mode) —
actions/github-script@v9.0.0mints an OIDC ID token viacore.getIDToken(audience)against the consumer-chosen audience (defaultrke2-infra.evba.lab, override viavars.RKE2_OIDC_AUDIENCE). Skipped when Option A's inputs aren't set, so Option B's kubeconfig path takes over below. - Build kubeconfig — assembles
$HOME/.kube/configfrom either the OIDC token + CA cert (Option A) or the base64-decoded kubeconfig secret (Option B). Pins the default namespace on the active context. Surfaceskubectl auth whoamiin logs so any later RBAC denial is debuggable. - Create ephemeral namespace — idempotent
apply --dry-run=client -o yaml | apply -f -so a leftover namespace from a cancelled run doesn't tripAlreadyExists. Labels the namespace withmeho.io/managed-by=pr-smokeandmeho.io/pr-number=<n>for consumer-side audit-log filtering. - Helm install —
helm upgrade --install meho deploy/charts/meho/ -f deploy/values-examples/values-rdc-example.yaml --set image.tag=pr-<n>-<sha> --wait --timeout 5m. Uses the in-tree example overlay as the base (real-target fixture layout) with the PR-tagged image as the only override. - Run smoke —
bash scripts/ci/pr-smoke.sh "$NS". See "Smoke contract" below. - Teardown —
if: always().helm uninstall+kubectl delete namespace --wait=false --ignore-not-found.|| trueon each so a partial-cleanup error doesn't block the namespace delete that follows. Finalkubectl get namespace "$NS"echo for observability. - PR comment —
if: always() && github.event.pull_request.number != ''.actions/github-script@v9.0.0posts a one-paragraph pass/fail summary with the workflow-run link.
The smoke script is deliberately scoped to the unauthenticated operator surface — four assertions, no Keycloak access token:
| Endpoint | Assertion | Why |
|---|---|---|
/healthz |
HTTP 200 | Liveness probe contract; the in-cluster kubelet uses this exact path |
/version |
git_sha present and not "unknown" |
Confirms the deployed image carries build metadata (image.yml / pr-smoke.yml pass GIT_SHA as a Docker build-arg, #631) and isn't a fallback build |
/version |
chart_version present, not "unknown", non-empty |
Confirms the helm-installed chart injected CHART_VERSION from .Chart.Version (#631) — the deployed-chart provenance the governance backplane exists to answer |
/api/v1/health |
HTTP 401 unauthenticated | Negative auth test — a 200 here would mean auth middleware regressed open OR Keycloak realm is wired wrong; both are PR-blocking regressions Goal #11 considers non-negotiable |
The full authenticated federation-chain smoke (claude-rdc-hetzner-dc/manifests/meho/smoke.sh — operator-facing, real Keycloak + Vault credentials, against the persistent install) is out of scope for the per-PR ephemeral lane: every PR provisioning a Keycloak realm + Vault role would be both slow and a security liability. Goal #11 G2.8 covers the authenticated smoke against the production-style instance.
Script invariants:
set -euo pipefailaborts on first failure (a half-ready backplane shouldn't be probed for more endpoints than the first one that failed).- Inline literal compare (
[ "$X" = "200" ]), not-eqfamily —-eq 200also matches an empty string from a failed curl on some bash builds. Literal-string comparison fails-loud as intended. - Background
kubectl port-forwardPID captured intoPF_PIDwith anEXITtrap that kills it even on bash abort, so a CI runner doesn't leak the port-forward process for the next job on the same runner. curl --retry 5 --retry-delay 1 --retry-connrefusedcovers the port-forward warm-up gap — the socket is bound before the kubectl-proxy handshake fully settles. The retry budget (5s total) is shorter than helm's--waitbudget (5m) so a failing rollout surfaces in helm, not in the smoke retry.
| Criterion (Task #50 AC) | Verification path |
|---|---|
| Workflow exists; runs after PR open/update | gh workflow list --repo evoila/meho shows pr-smoke |
Pushes image to ghcr.io/evoila/meho:pr-<n> |
Verifiable on a live PR run once consumer-side auth is provisioned |
Deploys chart to meho-ci-<n> on rke2-infra |
Same — deferred-AC pending consumer side |
| Tears down namespace regardless of smoke result | if: always() on teardown step; cancel-in-progress: true invariant covered in concurrency block above |
| Smoke result posted as PR comment | actions/github-script step with if: always() gate |
| Concurrency: PR update cancels prior smoke | concurrency.cancel-in-progress: true |
| Wall-clock < 8min for green smoke | 12-min job timeout with 8-min headline budget; cold buildx cache is the typical worst case; main-push image.yml's cache fills shared layers |
Deferred-AC: every criterion that requires a live run against
rke2-infra (image-push verification, namespace-create-then-teardown
verifiability) is gated on the consumer-side OIDC trust OR
RDC_KUBECONFIG secret landing. Until then the workflow ships in
"skipped at job level" mode and the AC bullets stay open in the
Task body's tracking checklist. The cross-repo doc's "Status" table
is the source of truth for when these bullets close.
pr-smoke.yml is the per-PR ephemeral-cluster layer above the
toolchain matrix; the layers below it are unchanged and run in
parallel on every PR:
| Layer | Workflow | Surface | Cost |
|---|---|---|---|
| Toolchain matrix | ci.yml |
Python + Go + Helm lint/template/test | ~5 min |
| Image gate | image.yml |
Dockerfile + deps build (path-filtered) | ~3 min on backend PRs, skipped otherwise |
| Chart validation | chart.yml |
helm lint + template + kubeconform | ~1 min |
| Per-PR ephemeral | pr-smoke.yml |
Build → deploy → smoke → teardown | ~6-8 min on green |
| Migration backward-compat | migration-compat.yml |
Path-scoped to backend/alembic/versions/** |
~30s when triggered |
pr-smoke.yml does not duplicate the image build with
image.yml — it pushes a transient PR-tag, while image.yml on PRs
builds without pushing (gate only). The two workflows share the GHA
buildx cache scope, so the smoke's build typically hits a warm cache
when application code is the only delta.
- Image —
ghcr.io/evoila/meho:<tag>from G2.4 (#31). Multi-arch (amd64 + arm64), cosign-signed, SBOM-attested. - Migration runner — invoked by the
pre-install,pre-upgradeJob hook defined intemplates/migration-job.yaml; shells out to the entrypoint added in G2.3-T3 (#29) —python -m meho_backplane.db.migrate. - Broadcast subchart — in-tree at
charts/broadcast/, Valkey 9.x per ADR 0005. Declared inChart.yaml'sdependencies:block withrepository: ""(local unpacked subchart; nohelm dependency updateneeded). - External Secrets Operator (ESO) — owns the Kubernetes Secrets the chart
references (
postgres.credentialsSecret, future Keycloak client secret, future Vault role bindings). ESO is consumer-owned by default; the chart references the synced Secrets by name only. The chart can optionally renderExternalSecretresources itself wheneso.enabled: true— see the ESO patterns section above.
- HPA / PDB / topologySpreadConstraints / ServiceMonitor / PrometheusRule — deferred to v0.2. v0.1 is single-replica per Goal #11 scope.
- Broadcast subchart HA (Sentinel/Cluster), persistence, auth — deferred to v0.2 per ADR 0005.
broadcast.externalEndpointopt-out for operators with a managed Redis/Valkey already running — deferred to v0.2 (thebroadcast.enabled: falseknob lands in v0.1 as the disable path).
The v0.1 deploy contract crosses one repo boundary: evoila/meho
produces the chart + image; the dogfooding consumer
evoila-bosnia/claude-rdc-hetzner-dc
operates the rke2-infra cluster the per-PR ephemeral smoke and
post-merge deploy run against. The handshake spec — cluster auth
options (OIDC trust preferred over a long-lived kubeconfig secret),
namespace-scoped RBAC for meho-ci-*, the meho-image-pushed
repository_dispatch event shape, and the verification commands
either side can run to prove the contract holds — lives in
docs/cross-repo/rke2-infra-coordination.md.
The companion script
scripts/cross-repo/verify-rke2-access.sh
automates the kubectl portion of the check.
Goal #11's Definition of Done is a deploy contract — it is satisfied
when a claude-rdc-hetzner-dc operator can install, smoke, roll back,
and observe a MEHO instance against the real lab in bounded wall-clock
time. The producer-side acceptance contracts that codify what
"passing" looks like for each DoD bullet live in
docs/acceptance/, each pointing at the
companion verifier shell under
scripts/acceptance/.
| DoD bullet | Contract | Verifier |
|---|---|---|
1 — install.sh cold-deploy → working MEHO in <5 min |
docs/acceptance/install.md |
scripts/acceptance/install-verify.sh |
2 — smoke.sh passes (login + status + audit-row + Vault + DB-migration state) — federation chain end-to-end |
docs/acceptance/smoke.md |
scripts/acceptance/smoke.sh |
3 — helm rollback meho end-to-end with a non-trivial schema diff (cluster-level forward-compat proof) |
docs/acceptance/rollback.md |
scripts/acceptance/rollback-verify.sh (sample N+1 migration at scripts/acceptance/synthetic-n-plus-1.sql) |
4 + 5 — 5-consecutive-merged-PR green-smoke counter + targets.yaml rdc-meho entry (deploy-stability proof + chassis registration) |
docs/acceptance/green-counter.md |
producer-side contract only — counter implementation and the targets.yaml entry land on the consumer side per the cross-repo split. Schema for the targets.yaml entry lives at docs/cross-repo/targets-yaml.md; the draft consumer-side issue body the maintainer files is at docs/cross-repo/issue-58-consumer-ticket-body.md; the README ships a badge placeholder the maintainer swaps for the live Shields endpoint URL once the consumer-side counter is up |
The split between producer-owned contracts + verifiers and
consumer-owned wrappers (install.sh, smoke.sh,
rollback-drill.sh, …) is the same shape as the cross-repo
handshake above: the chart producer owns "what passing means"; the
consumer owns "how to drive the install on this environment". The
verifier is invoked as the last step of the consumer's wrapper, and
the verifier's exit code becomes the wrapper's exit code.
The rollback contract is the cluster-level half of the
forward-compat assurance Goal #11 DoD bullet 3 promises; the
unit-level half lives at
backend/tests/test_migration_rollback.py
(Task #30, Initiative #26) and runs on every PR via
.github/workflows/migration-compat.yml.
Together they assert "the N image runs cleanly against the N+1
schema" at two layers: testcontainers in CI (fast, every PR) and
real helm rollback against the lab (slow, Goal-closing milestone).
- Parent Goal: #11 — Deployable v0.1
- Parent Initiative: #36 — G2.5 Helm chart
- Parent Initiative: #48 — G2.7 CI/CD + per-PR ephemeral smoke
- Task #50 (G2.7-T2) — Per-PR ephemeral cluster deploy + smoke + teardown
- Task #53 (G2.7-T5) — Cross-repo coordination tracker (consumer-side kubeconfig + RBAC)
- Task #55 (G2.8-T1) —
install.shcold-deploy acceptance contract + verifier (docs/acceptance/install.md,scripts/acceptance/install-verify.sh) - Task #56 (G2.8-T2) —
smoke.shfederation-chain acceptance contract + verifier (docs/acceptance/smoke.md,scripts/acceptance/smoke.sh) - Task #57 (G2.8-T3) —
helm rollbackend-to-end acceptance contract + verifier (docs/acceptance/rollback.md,scripts/acceptance/rollback-verify.sh,scripts/acceptance/synthetic-n-plus-1.sql) - Task #58 (G2.8-T4) — 5-consecutive-merged-PR green-smoke counter contract +
targets.yamlrdc-mehoschema (docs/acceptance/green-counter.md,docs/cross-repo/targets-yaml.md,docs/cross-repo/issue-58-consumer-ticket-body.md, README badge placeholder); consumer-side counter implementation +targets.yamlentry tracked onclaude-rdc-hetzner-dc - Task #30 (G2.3-T4) — unit-level forward-compat regression test (
backend/tests/test_migration_rollback.py) - Helm
helm rollbackreference: https://helm.sh/docs/helm/helm_rollback/ - Helm chart hooks reference: https://helm.sh/docs/topics/charts_hooks/
- GitHub Actions OIDC: https://docs.github.com/en/actions/concepts/security/openid-connect
pull_request_targethardening guide: https://securitylab.github.com/research/github-actions-preventing-pwn-requests/- Helm chart structure: https://helm.sh/docs/topics/charts/
- Kubernetes NetworkPolicy: https://kubernetes.io/docs/concepts/services-networking/network-policies/
- cert-manager Ingress annotations: https://cert-manager.io/docs/usage/ingress/
- External Secrets Operator: https://external-secrets.io/
- ESO Vault provider: https://external-secrets.io/latest/provider/hashicorp-vault/
- ESO ExternalSecret API: https://external-secrets.io/latest/api/externalsecret/
- Cross-repo handshake spec:
docs/cross-repo/rke2-infra-coordination.md