Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .codex/skills/qa-cluster-bringup/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,27 @@ If `cluster doctor` reports admin/UI/ingest key mismatches, roll the API,
UI, ingest, and gateway deployments after patching `mcp-sentinel-secrets`
(see `CLAUDE.md` → API keys). Do not paper over a `Degraded` reading.

### Kind image architecture gotcha

Before manually refreshing API/UI/operator/gateway images in Kind, match the
image platform to the Kind node:

```bash
NODE_ARCH="$(docker version --format '{{.Server.Arch}}')"
IMAGE="registry.registry.svc.cluster.local:5000/<repo>:qa-$(date +%s)"
docker build --platform="linux/${NODE_ARCH}" -t "$IMAGE" -f "$DOCKERFILE" .
docker image inspect "$IMAGE" --format '{{.Os}}/{{.Architecture}}'
kind load docker-image "$IMAGE" --name mcp-runtime
kubectl set image -n "$NAMESPACE" deploy/"$DEPLOYMENT" "$CONTAINER=$IMAGE"
kubectl rollout status -n "$NAMESPACE" deploy/"$DEPLOYMENT" --timeout=180s
```

Do not trust a successful `kind load docker-image` by itself. Containerd can
load a `linux/amd64` image into a `linux/arm64` Kind node, but the pod will not
run correctly. If a service image was built for the wrong architecture, rebuild
API and UI with the node architecture, load those exact image references, set
the deployment image to the same ref, and wait for rollout.

## Step 6 — Expose the gateway

```bash
Expand Down
43 changes: 42 additions & 1 deletion .codex/skills/qa-e2e-ui/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: qa-e2e-ui
description: Browser-first real-cluster Sentinel UI/dashboard QA - role-based navigation, auth flows, every tab, forms, filters, destructive actions, rendered data, network/API evidence, console evidence, responsive/accessibility checks, cleanup, static assets, and public-host defenses. Use when Codex is asked to QA UI changes, dashboard regressions, login/admin/tenant flows, browser-visible API behavior, or copyable MCP connect config. Complements qa-e2e-security with feature-correctness and browser interaction checks. Assumes qa-cluster-bringup has run.
description: Browser-first real-cluster Sentinel UI/dashboard QA - role-based navigation, auth flows, every tab, forms, filters, destructive actions, rendered data, network/API evidence, console evidence, responsive/accessibility checks, cleanup, static assets, and public-host defenses. Use when Codex is asked to QA UI changes, dashboard regressions, login/admin/tenant flows, browser-visible API behavior, copyable MCP connect config, or backend analytics/observability changes that must be validated through UI controls. Complements qa-e2e-security with feature-correctness and browser interaction checks. Assumes qa-cluster-bringup has run.
---

# QA - E2E UI (live cluster)
Expand Down Expand Up @@ -57,6 +57,23 @@ trap 'rm -rf "$QA_TMP"' EXIT
If the live cluster has unrelated user changes, work with them. Do not retire
or mutate non-temporary objects unless the user explicitly approves.

If you rebuild the API or UI image before a browser pass, build for the Kind
node architecture, not your memory of the last machine:

```bash
NODE_ARCH="$(docker version --format '{{.Server.Arch}}')"
IMAGE="registry.registry.svc.cluster.local:5000/<repo>:ui-qa-$(date +%s)"
docker build --platform="linux/${NODE_ARCH}" -t "$IMAGE" -f "$DOCKERFILE" .
docker image inspect "$IMAGE" --format '{{.Os}}/{{.Architecture}}'
kind load docker-image "$IMAGE" --name mcp-runtime
kubectl set image -n mcp-sentinel deploy/"$DEPLOYMENT" "$CONTAINER=$IMAGE"
kubectl rollout status -n mcp-sentinel deploy/"$DEPLOYMENT" --timeout=180s
```

`kind load docker-image` can load the wrong architecture into containerd. A UI
QA pass is invalid if the pod is still running an old image or an image whose
architecture does not match the Kind node.

## Step 2 - Choose audit mode

- **full-ui**: every visible capability and every role. Default for broad UI
Expand All @@ -73,10 +90,17 @@ Diff guidance:
| `services/ui/main.go` | Dashboard Load, Auth, UI->API Proxy, affected tabs |
| `services/ui/static/**` | Browser Matrix, Static Assets, Responsive/A11y |
| `services/api/internal/runtimeapi/**` | Catalog, Governance, Operations, API Contract |
| `services/api/internal/runtimeapi/*observability*`, `services/mcp-gateway/**`, `k8s/*prometheus*` | My Activity scoped observability, Prometheus and Grafana actions, scoped query evidence, tenant negative checks |
| `services/api/internal/platformstore/**` or auth code | Auth, API Keys, Role Gating |
| `config/ingress/**`, `k8s/**` UI ingress | Public-host Defense |
| `docs/**`, `website/**` only | Dashboard Load smoke, then report docs-only scope |

Before selecting checks, map every touched UI-visible change to a browser
workflow. If the diff affects data rendered by a tab but not `services/ui/**`
directly, still validate the owning tab and controls through the browser. Do
not report a backend-only pass for a change whose success or failure is visible
in the dashboard.

## Step 3 - Browser instrumentation is required

Prefer MCP browser tools in Codex sessions:
Expand Down Expand Up @@ -142,6 +166,23 @@ Create or mutate only temporary `qa-audit-*` objects for destructive actions.
Record every skipped page/control with the reason and the fixture or cluster
mode needed to cover it later.

For changed UI-visible behavior, include at least one positive and one negative
browser/API assertion for the exact changed control. Examples:

- Metrics/observability: tenant-owned `qa-audit-*` server shows per-server
`Prometheus` and `Grafana` actions by default. The clicked URL or fetched
link is scoped to its namespace/server, and a shared or foreign namespace
returns 403/404 for the same tenant session.
- Tenant observability buttons: tenant users should not see the raw header
`Prometheus` or `Grafana` links; those are admin-only cluster-wide surfaces.
Tenant users should see per-server `Prometheus` and `Grafana` actions in My
Activity when scoped observability is available.
- Analytics: changing the server selector triggers a network request with the
selected namespace/server filters and excludes shared catalog servers from
personal totals.
- Role-gated actions: the allowed role sees and can run the action; signed-out
or disallowed roles cannot see it or receive an intentional auth error.

## Step 6 - Curl smoke and API contract evidence

Use curl to support browser findings, not replace them.
Expand Down
18 changes: 17 additions & 1 deletion .codex/skills/qa-e2e-ui/references/ui-coverage.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ Admin (`admin@mcpruntime.org` / `admin@123`):
- Hidden tab fallback works after logout.
- Auto-refresh starts only for admin dashboard and stops on logout.
- Grafana and Prometheus header links appear only for admin in dev path mode.
- Normal tenant users never see the raw header Grafana/Prometheus links.

API-key login:

Expand Down Expand Up @@ -114,13 +115,28 @@ Tenant user only:

- Metrics: My Servers, Ready, Requests, Denied, Deny Rate.
- Deployed Servers table: identity, namespace, status, inventory, endpoint,
Analytics action, Copy URL, tenant Retire.
Analytics action, Prometheus action, Grafana action, Copy URL, tenant Retire.
- Usage controls: All servers, per-server selector, 24h/7d/30d/90d, Refresh.
- Usage tables: per-server usage, top tools, recent activity.
- Empty state: no personal servers.
- Shared catalog servers are excluded from personal count.
- Analytics empty response and API error render clear states.
- If a selected server disappears, selector and tables recover cleanly.
- Scoped observability: clicking Prometheus for a temporary tenant server opens
an allowlisted `/api/runtime/observability/prometheus/query` URL with
`namespace` and `server` filters for that exact server.
- Scoped observability links: `/api/runtime/observability/links` returns only
queries for the authorized tenant server and does not expose arbitrary PromQL.
- Grafana tenant action: clicking Grafana opens the default scoped platform
Grafana dashboard endpoint for that namespace/server. It must not open the raw
cluster-wide `/grafana` UI for tenant users.
- Tenant negative checks: the same user receives an intentional 403/404 when
requesting observability links or Prometheus queries for a shared or foreign
namespace/server.
- Backend metrics regressions: when the diff touches gateway metrics,
Prometheus scrape config, or runtime observability APIs, the browser matrix
must include the Metrics button plus network evidence from the scoped
Prometheus query endpoint, even when no UI file changed.

## API Keys

Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ r.go
tags

# Local/generated artifacts
/.playwright-mcp/
/mcp-sentinel/
/services/mcp-server/mcp-sentinel-mcp-server
/services/ui/mcp-sentinel-ui
Expand Down
4 changes: 3 additions & 1 deletion docs/cluster-readiness.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,9 @@ When `MCP_PLATFORM_DOMAIN=example.com` is set, setup derives these public names:

- `registry.example.com` for registry ingress.
- `mcp.example.com` for MCP server traffic.
- `platform.example.com` for the dashboard, API, and Grafana paths. Prometheus remains an internal metrics backend.
- `platform.example.com` for the dashboard, API, and admin-gated Grafana paths.
Prometheus remains an internal metrics backend; tenant users reach it only
through server-scoped API queries.

All configured public names must resolve to the cluster ingress address before
certificate issuance. For Let's Encrypt HTTP-01, port 80 must reach the ingress
Expand Down
6 changes: 6 additions & 0 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,12 @@ When building setup images from a machine with a different CPU architecture
than the cluster, set `MCP_IMAGE_PLATFORM` to the target node platform, for
example `MCP_IMAGE_PLATFORM=linux/amd64` for standard VPS/k3s nodes.

Tenant users open server-scoped Prometheus and Grafana views from Activity
server rows. The platform API verifies access to the exact `MCPServer` and
expands only allowlisted Prometheus queries; raw Prometheus remains internal.
The bundled Prometheus discovers operator-managed MCPServer Services and
scrapes metrics from their gateway sidecars.

You can also skip the saved provision step and pass
`--external-registry-url registry.example.com` directly to `setup`.

Expand Down
2 changes: 2 additions & 0 deletions docs/internals/go-package-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -2451,6 +2451,8 @@ const (
DefaultPort = 8088
// DefaultGatewayPort is the default container port for the MCP proxy sidecar.
DefaultGatewayPort = 8091
// DefaultGatewayMetricsPort is the default Prometheus scrape port for the MCP gateway sidecar.
DefaultGatewayMetricsPort = 9103
// DefaultServicePort is the default service port.
DefaultServicePort = 80
)
Expand Down
3 changes: 3 additions & 0 deletions docs/security/authz-matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,9 @@ Expected codes:
| `/api/user/api-keys` | GET, POST | 401 | 200 | 200 | 200 | 401/403 | Lifecycle for user-owned keys. |
| `/api/user/api-keys/{id}` | GET, DEL | 401 | 200 | 200 | 200 | 401/403 | |
| `/api/runtime/servers` | GET, POST | 401 | 200 | 200 | 200 | 401/403 | List/create MCP servers. |
| `/api/runtime/observability/links` | GET | 401 | 200/403 | 200/403 | 200 | 401/403 | Normal users are limited to team namespaces or caller-owned catalog servers. |
| `/api/runtime/observability/grafana/dashboard` | GET | 401 | 200/403 | 200/403 | 200 | 401/403 | Renders a server-scoped dashboard through the API. |
| `/api/runtime/observability/prometheus/query` | GET | 401 | 200/403 | 200/403 | 200 | 401/403 | PromQL is allowlisted and server-scoped by the API. |
| `/api/runtime/teams` | GET | 401 | 200 | 200 | 200 | 401/403 | |
| `/api/runtime/teams` | POST | 401 | 403 | 403 | 200 | 401/403 | Admin-only team + namespace provisioning. |
| `/api/runtime/teams/{id}` | GET | 401 | 200 | 200 | 200 | 401/403 | Team members can read only their teams; admins can read all teams. |
Expand Down
33 changes: 31 additions & 2 deletions docs/sentinel.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
| **mcp-gateway** | Transparent sidecar. Extracts identity, evaluates tool-level policy, emits allow/deny audit events, forwards traffic upstream. |
| **ingest** | Receives `POST /events`, validates ingest-scoped API keys or optional JWTs, writes to Kafka. |
| **processor** | Consumes Kafka, batches, writes into ClickHouse with indexed audit fields. |
| **api** | Analytics endpoints, dashboard summaries, user/team-scoped analytics, runtime governance APIs (grants/sessions), platform audit, MCP server catalog, component operations. |
| **api** | Analytics endpoints, dashboard summaries, user/team-scoped analytics, scoped observability links and Prometheus queries, runtime governance APIs (grants/sessions), platform audit, MCP server catalog, component operations. |
| **ui** | Control-plane dashboard: user MCP server dashboard, MCP server catalog and connect config, user API keys, analytics dashboard, governance, MCP operations, and platform management. |
| **gateway** | Kubernetes deployment fronting the sentinel API, ingest, and UI surfaces. |
| **workspace assistant sample** | Sample MCP server in `examples/workspace-assistant-mcp` for end-to-end smoke tests. |
Expand Down Expand Up @@ -106,6 +106,32 @@ For local `setup --test-mode` clusters, setup seeds two email/password logins:
| **Prometheus** | Not exposed | `prometheus:9090` | Internal metrics backend and Grafana datasource. Use a temporary `kubectl port-forward` only for backend debugging. |
| **MCP gateway sidecar** | per-server route, for example `/workspace-assistant-mcp/mcp` | pod-local sidecar port | Enforces policy and forwards to the MCP server container. |

### Scoped user observability

The Activity server list exposes Prometheus and Grafana actions only for an
`MCPServer` the authenticated principal can observe. The API checks the live
server before returning links or querying Prometheus, and normal users are
limited to their team namespaces or explicitly caller-owned catalog servers.

Prometheus requests use
`/api/runtime/observability/prometheus/query?namespace=<namespace>&server=<server>&query_id=<id>`.
The `query_id` is allowlisted (`up`, `request_rate`, `deny_rate`,
`latency_p95`); arbitrary PromQL is never accepted. `PROMETHEUS_API_URL`
defaults to `http://prometheus:9090/prometheus`.

Without an external Grafana dashboard template, the API renders a scoped
dashboard from the same allowlisted queries. Set `GRAFANA_SERVER_DASHBOARD_URL`
to a template containing `{namespace}` and `{server}` only when that Grafana
deployment enforces tenant-aware access. Normal-user external links also
require `GRAFANA_SCOPED_USER_ACCESS=true`.

The bundled Prometheus uses read-only Kubernetes discovery for annotated
MCPServer Services and scrapes the gateway sidecar `/metrics` endpoint.
Gateway metrics include request totals, policy decisions, latency, request and
response bytes, in-flight requests, and policy reload state. HTTP and MCP method
labels are normalized to bounded sets to prevent attacker-controlled label
cardinality.

### Auth model

| Service | Auth behavior |
Expand All @@ -114,7 +140,7 @@ For local `setup --test-mode` clusters, setup seeds two email/password logins:
| **ui** | `/auth/login` creates an HttpOnly UI session from `api_key`, `id_token`, or `email`/`password`. The UI then proxies `/api/*` with an upstream API key or bearer token. `/auth/admin-check` accepts admin UI sessions or keys from `ADMIN_API_KEYS`; it falls back to `API_KEYS` only when the explicit legacy dev/test fallback is enabled. |
| **ingest** | `/live`, `/ready`, and `/health` are open. `/events` accepts `x-api-key` from `INGEST_API_KEYS`, legacy `API_KEYS`, or a configured OIDC bearer token. If no API keys and no JWKS are configured, intake auth is bypassed. |
| **processor** | No data API. It exposes metrics and a simple health check on the metrics port. |
| **mcp-gateway** | No admin API. It authenticates MCP requests according to the rendered server policy: header identity or OAuth bearer tokens, depending on `spec.auth`. |
| **mcp-gateway** | No admin API. It authenticates MCP requests according to the rendered server policy and exposes Prometheus metrics at `/metrics`. |

### API service

Expand All @@ -140,6 +166,9 @@ metrics on `METRICS_PORT` (default `9090`).
| `GET`, `POST` | `/api/runtime/servers` | List or apply `MCPServer` resources through runtime authz scope. `tenant` mode defaults signed-in users to team namespaces they belong to; `org` mode includes the org catalog plus team namespaces; `public` mode allows anonymous catalog reads and signed-in publishes in the public catalog plus team namespaces. Responses include `publish_policy` for active-server quota/cooldown visibility. |
| `DELETE` | `/api/runtime/servers/{namespace}/{name}` | Retire an owned MCPServer. Retiring deletes the MCPServer from Kubernetes and frees one active-server quota slot. |
| `GET` | `/api/runtime/server-events?namespace=&server=` | Recent analytics events for one administered MCPServer; full identity/session/payload details are not exposed to regular namespace readers. |
| `GET` | `/api/runtime/observability/links?namespace=&server=` | Return scoped observability actions for one authorized MCPServer. |
| `GET` | `/api/runtime/observability/grafana/dashboard?namespace=&server=` | Render an API-scoped dashboard backed by allowlisted Prometheus queries. |
| `GET` | `/api/runtime/observability/prometheus/query?namespace=&server=&query_id=` | Run one allowlisted, server-scoped Prometheus query. |
| `GET`, `POST` | `/api/runtime/grants` | List or apply `MCPAccessGrant` resources. Lists are scoped to administered servers; apply requires admin, server owner, or team owner. |
| `GET`, `DELETE` | `/api/runtime/grants/{namespace}/{name}` | Read or delete one grant for an administered server. |
| `PATCH` | `/api/runtime/grants/{namespace}/{name}` | Set `spec.disabled` with `{"disabled":true|false}`; requires admin, server owner, or team owner. |
Expand Down
23 changes: 23 additions & 0 deletions internal/cli/setup/platform/helpers_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2224,6 +2224,29 @@ func TestPrometheusScrapesClickHouseMetricsPort(t *testing.T) {
}
}

func TestPrometheusDiscoversGatewaySidecarMetrics(t *testing.T) {
content, err := os.ReadFile("../../../../k8s/11-prometheus.yaml")
if err != nil {
t.Fatalf("failed to read prometheus manifest: %v", err)
}
text := string(content)
for _, want := range []string{
"kind: ServiceAccount",
"name: prometheus",
"name: mcp-sentinel-prometheus-discovery",
"resources: [\"endpoints\", \"pods\", \"services\"]",
"job_name: mcp-gateway-sidecars",
"role: endpoints",
"__meta_kubernetes_service_annotation_prometheus_io_scrape",
"__meta_kubernetes_service_label_app_kubernetes_io_managed_by",
"__meta_kubernetes_service_annotation_prometheus_io_port",
} {
if !strings.Contains(text, want) {
t.Fatalf("expected Prometheus manifest to contain %q, got:\n%s", want, text)
}
}
}

func TestClickHouseExposesPrometheusMetrics(t *testing.T) {
for _, path := range []string{"../../../../k8s/03-clickhouse.yaml", "../../../../k8s/03-clickhouse-hostpath.yaml"} {
content, err := os.ReadFile(path)
Expand Down
2 changes: 2 additions & 0 deletions internal/operator/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ const (
DefaultPort = 8088
// DefaultGatewayPort is the default container port for the MCP proxy sidecar.
DefaultGatewayPort = 8091
// DefaultGatewayMetricsPort is the default Prometheus scrape port for the MCP gateway sidecar.
DefaultGatewayMetricsPort = 9103
// DefaultServicePort is the default service port.
DefaultServicePort = 80
)
Expand Down
Loading
Loading