Skip to content

chore: integrate prometheus in tilt#1140

Open
andyatmiami wants to merge 1 commit into
kubeflow:notebooks-v2from
andyatmiami:chore/tilt-prometheus-scaffold
Open

chore: integrate prometheus in tilt#1140
andyatmiami wants to merge 1 commit into
kubeflow:notebooks-v2from
andyatmiami:chore/tilt-prometheus-scaffold

Conversation

@andyatmiami
Copy link
Copy Markdown
Contributor

@andyatmiami andyatmiami commented Jun 2, 2026

ℹ️ NO GH ISSUE

Add opt-in Prometheus support to the KinD/Tilt development environment so developers can verify that metrics are properly exposed, ServiceMonitor manifests are valid, and the full scrape pipeline works end-to-end.

Development environment (developing/)

  • Add setup-prometheus.sh script that installs the Prometheus Operator via upstream bundle.yaml and deploys a minimal Prometheus instance in prometheus-system namespace. Follows the same idempotent pattern as setup-cert-manager.sh and setup-istio.sh.

  • Add manifests/prometheus/prometheus.yaml with the Prometheus CR, ServiceAccount, ClusterRole, and ClusterRoleBinding. Configured with minimal resources (128Mi/256Mi), 2h retention, and wildcard ServiceMonitor discovery across all namespaces.

  • Add manifests/overlays/controller-prometheus/kustomization.yaml, a dev-only kustomize overlay that layers the existing prometheus component on top of the istio overlay. This avoids creating a new overlay in the controller's published manifests — enabling prometheus is a dev-environment concern, not a vendoring decision.

  • Update Makefile with setup-prometheus target, conditionally invoked by tilt-up when ENABLE_PROMETHEUS=true.

  • Update Tiltfile with ENABLE_PROMETHEUS env var, conditional overlay selection, and a prometheus-ui local_resource that port-forwards to localhost:9090.

Production manifests (workspaces/)

NetworkPolicy for metrics scraping

Add network-policy.yaml to the controller's prometheus kustomize component. Without this, the existing default-deny-ingress policy blocks Prometheus from reaching the controller's metrics port. Uses from: [] (allow from any source) scoped to the metrics port, matching the same pattern used by the istio component's webhook NetworkPolicy.

Remove managed-by from CRD selectors

The common kustomize component previously used a single labels block with includeSelectors: true for all three standard labels, including app.kubernetes.io/managed-by: kustomize. This injected managed-by into every selector field — including CRD-specific selectors like ServiceMonitor.spec.selector and AuthorizationPolicy.spec.selector.

Deployment tools (Tilt, ArgoCD, Flux) override managed-by on resource metadata labels but cannot follow CRD-specific selector fields, causing selectors to not match their target resources. For example, Tilt sets managed-by: tilt on the metrics Service labels, but the ServiceMonitor selector still expects managed-by: kustomize.

Fix by splitting the common component's labels block in all three modules (controller, backend, frontend):

  • name and part-of: includeSelectors: true (stable identifiers)
  • managed-by: includeSelectors: false, includeTemplates: true (deployment method label — belongs on metadata, not in selectors)

Also remove hardcoded managed-by: kustomize from CRD selectors in:

  • controller/components/prometheus/monitor.yaml (ServiceMonitor)
  • backend/components/istio/authorization-policy.yaml
  • frontend/components/istio/authorization-policy.yaml

Usage

Default — no Prometheus (unchanged behavior)
make tilt-up

With Prometheus enabled
ENABLE_PROMETHEUS=true make tilt-up

@github-project-automation github-project-automation Bot moved this to Needs Triage in Kubeflow Notebooks Jun 2, 2026
@google-oss-prow google-oss-prow Bot added the area/backend area - related to backend components label Jun 2, 2026
@google-oss-prow google-oss-prow Bot requested review from caponetto and thaorell June 2, 2026 14:22
@google-oss-prow google-oss-prow Bot added area/ci area - related to ci area/controller area - related to controller components area/frontend area - related to frontend components area/v2 area - version - kubeflow notebooks v2 size/L labels Jun 2, 2026
@andyatmiami
Copy link
Copy Markdown
Contributor Author

BEHOLD! (controller) METRICS!

image

Add opt-in Prometheus support to the KinD/Tilt development environment
so developers can verify that metrics are properly exposed, ServiceMonitor
manifests are valid, and the full scrape pipeline works end-to-end.

## Development environment (developing/)

- Add `setup-prometheus.sh` script that installs the Prometheus Operator
  via upstream `bundle.yaml` and deploys a minimal Prometheus instance
  in `prometheus-system` namespace. Follows the same idempotent pattern
  as `setup-cert-manager.sh` and `setup-istio.sh`.

- Add `manifests/prometheus/prometheus.yaml` with the Prometheus CR,
  ServiceAccount, ClusterRole, and ClusterRoleBinding. Configured with
  minimal resources (128Mi/256Mi), 2h retention, and wildcard
  ServiceMonitor discovery across all namespaces.

- Add `manifests/overlays/controller-prometheus/kustomization.yaml`, a
  dev-only kustomize overlay that layers the existing prometheus
  component on top of the istio overlay. This avoids creating a new
  overlay in the controller's published manifests — enabling prometheus
  is a dev-environment concern, not a vendoring decision.

- Update `Makefile` with `setup-prometheus` target, conditionally
  invoked by `tilt-up` when `ENABLE_PROMETHEUS=true`.

- Update `Tiltfile` with `ENABLE_PROMETHEUS` env var, conditional
  overlay selection, and a `prometheus-ui` local_resource that
  port-forwards to localhost:9090.

## Production manifests (workspaces/)

### NetworkPolicy for metrics scraping

Add `network-policy.yaml` to the controller's prometheus kustomize
component. Without this, the existing `default-deny-ingress` policy
blocks Prometheus from reaching the controller's metrics port. Uses
`from: []` (allow from any source) scoped to the metrics port, matching
the same pattern used by the istio component's webhook NetworkPolicy.

### Remove `managed-by` from CRD selectors

The `common` kustomize component previously used a single `labels` block
with `includeSelectors: true` for all three standard labels, including
`app.kubernetes.io/managed-by: kustomize`. This injected `managed-by`
into every selector field — including CRD-specific selectors like
`ServiceMonitor.spec.selector` and `AuthorizationPolicy.spec.selector`.

Deployment tools (Tilt, ArgoCD, Flux) override `managed-by` on resource
metadata labels but cannot follow CRD-specific selector fields, causing
selectors to not match their target resources. For example, Tilt sets
`managed-by: tilt` on the metrics Service labels, but the ServiceMonitor
selector still expects `managed-by: kustomize`.

Fix by splitting the `common` component's `labels` block in all three
modules (controller, backend, frontend):
- `name` and `part-of`: `includeSelectors: true` (stable identifiers)
- `managed-by`: `includeSelectors: false, includeTemplates: true`
  (deployment method label — belongs on metadata, not in selectors)

Also remove hardcoded `managed-by: kustomize` from CRD selectors in:
- `controller/components/prometheus/monitor.yaml` (ServiceMonitor)
- `backend/components/istio/authorization-policy.yaml`
- `frontend/components/istio/authorization-policy.yaml`

## Usage

  # Default — no Prometheus (unchanged behavior)
  make tilt-up

  # With Prometheus enabled
  ENABLE_PROMETHEUS=true make tilt-up

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Andy Stoneberg <astonebe@redhat.com>
@andyatmiami andyatmiami force-pushed the chore/tilt-prometheus-scaffold branch from 5672dd6 to 87840dc Compare June 2, 2026 14:42
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign thesuperzapper for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@christian-heusel
Copy link
Copy Markdown
Contributor

Current manifests diff:

Backend
diff -up manifests-diff/backend/*
--- /home/chris/Documents/shared_projects/kion/kubeflow-notebooks/manifests-diff/backend/base	2026-06-04 16:53:05.529052312 +0200
+++ /home/chris/Documents/shared_projects/kion/kubeflow-notebooks/manifests-diff/backend/head	2026-06-04 16:53:05.941054007 +0200
@@ -133,7 +133,6 @@ spec:
   selector:
     app: workspaces-backend
     app.kubernetes.io/component: api
-    app.kubernetes.io/managed-by: kustomize
     app.kubernetes.io/name: workspaces-backend
     app.kubernetes.io/part-of: kubeflow-workspaces
   type: ClusterIP
@@ -155,7 +154,6 @@ spec:
     matchLabels:
       app: workspaces-backend
       app.kubernetes.io/component: api
-      app.kubernetes.io/managed-by: kustomize
       app.kubernetes.io/name: workspaces-backend
       app.kubernetes.io/part-of: kubeflow-workspaces
   strategy:
@@ -291,7 +289,6 @@ spec:
     matchLabels:
       app: workspaces-backend
       app.kubernetes.io/component: api
-      app.kubernetes.io/managed-by: kustomize
       app.kubernetes.io/name: workspaces-backend
       app.kubernetes.io/part-of: kubeflow-workspaces
   policyTypes:
@@ -318,6 +315,5 @@ spec:
     matchLabels:
       app: workspaces-backend
       app.kubernetes.io/component: api
-      app.kubernetes.io/managed-by: kustomize
       app.kubernetes.io/name: workspaces-backend
       app.kubernetes.io/part-of: kubeflow-workspaces
Controller
diff -up manifests-diff/controller/*
--- /home/chris/Documents/shared_projects/kion/kubeflow-notebooks/manifests-diff/controller/base	2026-06-04 16:53:05.760053262 +0200
+++ /home/chris/Documents/shared_projects/kion/kubeflow-notebooks/manifests-diff/controller/head	2026-06-04 16:53:06.174054965 +0200
@@ -5457,7 +5457,6 @@ spec:
   selector:
     app: workspaces-controller
     app.kubernetes.io/component: controller-manager
-    app.kubernetes.io/managed-by: kustomize
     app.kubernetes.io/name: workspaces-controller
     app.kubernetes.io/part-of: kubeflow-workspaces
 ---
@@ -5478,7 +5477,6 @@ spec:
     matchLabels:
       app: workspaces-controller
       app.kubernetes.io/component: controller-manager
-      app.kubernetes.io/managed-by: kustomize
       app.kubernetes.io/name: workspaces-controller
       app.kubernetes.io/part-of: kubeflow-workspaces
   template:
@@ -5635,7 +5633,6 @@ spec:
     matchLabels:
       app: workspaces-controller
       app.kubernetes.io/component: controller-manager
-      app.kubernetes.io/managed-by: kustomize
       app.kubernetes.io/name: workspaces-controller
       app.kubernetes.io/part-of: kubeflow-workspaces
   policyTypes:
Frontend
diff -up manifests-diff/frontend/*
--- /home/chris/Documents/shared_projects/kion/kubeflow-notebooks/manifests-diff/frontend/base	2026-06-04 16:53:05.819053505 +0200
+++ /home/chris/Documents/shared_projects/kion/kubeflow-notebooks/manifests-diff/frontend/head	2026-06-04 16:53:06.235055216 +0200
@@ -17,7 +17,6 @@ spec:
   selector:
     app: workspaces-frontend
     app.kubernetes.io/component: ui
-    app.kubernetes.io/managed-by: kustomize
     app.kubernetes.io/name: workspaces-frontend
     app.kubernetes.io/part-of: kubeflow-workspaces
   type: ClusterIP
@@ -39,7 +38,6 @@ spec:
     matchLabels:
       app: workspaces-frontend
       app.kubernetes.io/component: ui
-      app.kubernetes.io/managed-by: kustomize
       app.kubernetes.io/name: workspaces-frontend
       app.kubernetes.io/part-of: kubeflow-workspaces
   strategy:
@@ -170,7 +168,6 @@ spec:
     matchLabels:
       app: workspaces-frontend
       app.kubernetes.io/component: ui
-      app.kubernetes.io/managed-by: kustomize
       app.kubernetes.io/name: workspaces-frontend
       app.kubernetes.io/part-of: kubeflow-workspaces
   policyTypes:
@@ -197,6 +194,5 @@ spec:
     matchLabels:
       app: workspaces-frontend
       app.kubernetes.io/component: ui
-      app.kubernetes.io/managed-by: kustomize
       app.kubernetes.io/name: workspaces-frontend
       app.kubernetes.io/part-of: kubeflow-workspaces

Copy link
Copy Markdown
Contributor

@christian-heusel christian-heusel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @andyatmiami awesome to get a good starting point to be able to easily develop metrics! 🔥

Overall it seems to be working, I have left a review below 🤗

Comment on lines +31 to +37
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=prometheus-operator \
-n default \
--timeout=120s

kubectl wait --for=condition=established crd/prometheuses.monitoring.coreos.com --timeout=60s
kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there is a race-condition/fallthrough here:

Waiting for Prometheus to be ready...
error: no matching resources found
make[1]: *** [Makefile:55: setup-prometheus] Error 1
make[1]: Leaving directory '/home/chris/Documents/shared_projects/kion/kubeflow-notebooks/developing'
make: *** [Makefile:60: tilt-up] Error 2

echo "Installing Prometheus Operator ${PROMETHEUS_OPERATOR_VERSION}..."
# NOTE: must use 'kubectl create' not 'kubectl apply' — bundle.yaml is too large
# for client-side apply annotations.
curl -sL "${BUNDLE_URL}" | kubectl create -f - 2>/dev/null || \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can omit curl here, as kubectl create -f already supports URLs:

-f, --filename=[]      Filename, directory, or URL to files to use to create the resource


# ---- Step 2: Deploy Prometheus instance ----

kubectl create namespace "${PROMETHEUS_NAMESPACE}" --dry-run=client -o yaml | kubectl apply -f -
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a short comment that details that the reason for this kubectl create [...] --dry-run=client -o yaml | kubectl apply -f - construct is that it doesn't fail if the namespace already exists 🤗

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also couldn't we just add the manifest of the namespace to developing/manifests/prometheus/prometheus.yaml to omit this statement completely? 🤔

Comment thread developing/Makefile
tilt-up: check-tilt setup-istio kustomize
ifeq ($(ENABLE_PROMETHEUS),true)
@$(MAKE) setup-prometheus
endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also document this new variable (and maybe the existance/purpose of the whole setup in DEVELOPMENT_GUIDE.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/backend area - related to backend components area/ci area - related to ci area/controller area - related to controller components area/frontend area - related to frontend components area/v2 area - version - kubeflow notebooks v2 size/L

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

2 participants