Skip to content

feat: Metrics for cardano nodes#25

Merged
gonzalezzfelipe merged 6 commits intomainfrom
feat/metrics-for-cnodes
Apr 16, 2026
Merged

feat: Metrics for cardano nodes#25
gonzalezzfelipe merged 6 commits intomainfrom
feat/metrics-for-cnodes

Conversation

@gonzalezzfelipe
Copy link
Copy Markdown
Contributor

@gonzalezzfelipe gonzalezzfelipe commented Apr 15, 2026

image

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced dashboard workload details page with expanded metrics sections (Node, Mempool, Connections, Block Propagation, Resources, Producer) with formatted values and role-specific display
    • Added information tooltips throughout the dashboard UI
    • Implemented metrics collection from containerized pods with comprehensive formatting
  • Infrastructure

    • Updated Apex Fusion metrics port and Helm configuration
    • Added metrics script execution capability within pod containers
    • Improved dashboard query management for workload data

@gonzalezzfelipe gonzalezzfelipe requested a review from mduthey April 15, 2026 14:06
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

This PR introduces metrics collection from Cardano node pods into the dashboard and updates the Helm charts to support this capability. It adds new Kubernetes exec utilities, configures metrics scripts in ConfigMaps for both cardano-node and apex-fusion charts, refactors the dashboard workload details page to use TanStack Query and display rich metrics, and adds comprehensive metric formatting helpers.

Changes

Cohort / File(s) Summary
Apex Fusion Helm Chart
extensions/apex-fusion/values.yaml, extensions/apex-fusion/templates/_helpers.tpl, extensions/apex-fusion/templates/configmap-metrics.yaml, extensions/apex-fusion/templates/statefulset.yaml
Updated metrics port from 12798 to 12789, added metricsConfigMapName helper, created ConfigMap containing a Bash script that fetches Prometheus metrics from the node, parses genesis parameters, and outputs a JSON metrics object with error tracking. Updated StatefulSet to mount the metrics script and set METIS_SHELLEY_GENESIS_PATH environment variable.
Cardano Node Helm Chart
extensions/cardano-node/templates/_helpers.tpl, extensions/cardano-node/templates/configmap-metrics.yaml, extensions/cardano-node/templates/statefulset.yaml
Added metricsConfigMapName helper and created ConfigMap with a Bash metrics collection script that extracts Prometheus metrics, parses genesis data, computes derived fields (density, epoch progress, KES expiration, sync percent), and outputs JSON with error handling. Updated StatefulSet to mount metrics script and configure genesis path environment variable.
Dashboard Type Definitions
frontends/dashboard/@types/workload.d.ts
Extended SimplifiedPod with optional chartName and metrics properties. Added new type aliases and CardanoNodeMetrics interface containing node role, 40+ metric fields (all nullable), and an errors array.
Dashboard UI Components
frontends/dashboard/src/components/ui/InfoTooltip.tsx
Added new InfoTooltip React component that renders an info icon with toggleable tooltip, supporting keyboard interaction (Escape to close) and accessibility attributes (aria-label, aria-describedby, role="tooltip").
Dashboard Route & Query
frontends/dashboard/src/routes/$namespace/$name/index.tsx
Refactored to use TanStack Query with 30s refetch interval, replaced metrics display with multi-section layout ("Node", "Mempool", "Connections", "Block propagation", "Resources", "Producer"), integrated InfoCard with optional description tooltips, updated log streaming from callback-based to effect-based per pod identity, and added metric value formatting.
Dashboard Utilities
frontends/dashboard/src/utils/helm.ts, frontends/dashboard/src/utils/home/calls.ts, frontends/dashboard/src/utils/k8s.ts, frontends/dashboard/src/utils/metricsFormat.ts
Added getMergedHelmValues and getNodeRoleFromHelmRelease for Helm data extraction; added execPodCommand to execute commands in pod containers via Kubernetes Exec with stdout/stderr capture and 15s timeout; added metrics collection helper supportsCardanoNodeMetrics and getCardanoNodeMetricsForPod that executes /opt/metis/bin/metrics.sh inside pods; created comprehensive metrics formatting module with 17 helpers for numeric, temporal, percentage, and role-based formatting.
GitHub Actions & Configuration
.github/workflows/check_extensions.yml, extensions/dolos/ci/values-default.yaml, frontends/dashboard/package.json
Updated workflow to add HashiCorp Helm repo and supply additional values file to Dolos chart test, expanded kubeconform skip list to include VaultStaticSecret, modified test script to pass with no tests found, and added upstream relay configuration to Dolos values.

Sequence Diagram(s)

sequenceDiagram
    participant Dashboard as Dashboard Component
    participant QueryClient as TanStack Query
    participant KubeAPI as Kubernetes API
    participant PodExec as Pod Exec Stream
    participant MetricsScript as metrics.sh Script
    participant Prometheus as Prometheus Endpoint

    Dashboard->>QueryClient: ensureQueryData(workloadDetailsQueryOptions)
    QueryClient->>KubeAPI: getServerWorkloadPods(namespace, name)
    KubeAPI->>KubeAPI: fetch pods & helm release
    
    KubeAPI->>Dashboard: return pod list + helm metadata
    Dashboard->>Dashboard: getNodeRoleFromHelmRelease(helmRelease)
    
    loop For each pod (parallel)
        Dashboard->>KubeAPI: execPodCommand(pod, 'metrics.sh')
        KubeAPI->>PodExec: establish exec websocket
        PodExec->>MetricsScript: execute /opt/metis/bin/metrics.sh
        MetricsScript->>Prometheus: curl http://127.0.0.1:metricsPort/metrics
        Prometheus-->>MetricsScript: Prometheus metrics text
        MetricsScript->>MetricsScript: parse metrics, read genesis, compute derived
        MetricsScript-->>PodExec: output JSON metrics object
        PodExec-->>KubeAPI: stdout/stderr streams close
        KubeAPI-->>Dashboard: { stdout: JSON metrics, stderr: errors }
    end
    
    Dashboard->>Dashboard: render metrics sections with formatMetricValue helpers
    Dashboard-->>Dashboard: display in InfoCard components with tooltips
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • scarmuega

Poem

🐰 With metrics fresh from pods so keen,
The dashboard now shows what they glean—
Through scripts and shells, data flows fast,
From genesis times to metrics so vast!
A hop, skip, and merge brings it all to light!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Metrics for cardano nodes' directly and accurately summarizes the main objective of this pull request, which adds comprehensive metrics collection and display capabilities for Cardano nodes across the dashboard and Helm charts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/metrics-for-cnodes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
frontends/dashboard/src/utils/metrics.ts (1)

140-149: Consider parallelizing metric variant queries for performance.

getFirstMatchingMetricValue queries metric variants sequentially. Since these are independent lookups for the same conceptual metric (just different naming conventions), you could query them in parallel and return the first non-null result.

♻️ Parallel query approach
 async function getFirstMatchingMetricValue(queries: string[]): Promise<number | null> {
-  for (const query of queries) {
-    const value = await getInstantMetricValue(query);
-    if (value !== null) {
-      return value;
-    }
-  }
-
-  return null;
+  const results = await Promise.all(queries.map(getInstantMetricValue));
+  return results.find(v => v !== null) ?? null;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontends/dashboard/src/utils/metrics.ts` around lines 140 - 149,
getFirstMatchingMetricValue currently awaits getInstantMetricValue sequentially;
change it to start all getInstantMetricValue calls concurrently (map queries ->
promises) and then await their results (e.g., Promise.allSettled or Promise.all)
and return the first non-null fulfilled value found, otherwise return null.
Update getFirstMatchingMetricValue to fire promises for each query, handle
failures/settled results safely, and preserve the Promise<number | null> return
type; reference getFirstMatchingMetricValue and getInstantMetricValue when
making the change.
frontends/dashboard/src/components/ui/InfoTooltip.tsx (1)

37-39: Potential interaction conflict between click and focus/blur handlers.

When clicking an already-focused button, onClick toggles the state, but the subsequent blur/focus cycle may immediately override it. On touch devices, onBlur firing after a tap can close the tooltip unexpectedly. Consider simplifying to use only hover/focus for desktop and click-to-toggle for touch, or use a single interaction model.

♻️ Suggested simplification using pointer events
       <button
         ref={buttonRef}
         type="button"
         className="inline-flex cursor-help items-center text-[`#969FAB`] outline-none transition-colors hover:text-[`#64748B`] focus-visible:text-[`#64748B`]"
         aria-label="Show metric description"
         aria-describedby={open ? tooltipId : undefined}
         onFocus={() => setOpen(true)}
-        onBlur={() => setOpen(false)}
-        onClick={() => setOpen(prev => !prev)}
+        onBlur={(e) => {
+          // Don't close if clicking within the tooltip trigger
+          if (!e.currentTarget.contains(e.relatedTarget as Node)) {
+            setOpen(false);
+          }
+        }}
         onKeyDown={handleKeyDown}
       >
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontends/dashboard/src/components/ui/InfoTooltip.tsx` around lines 37 - 39,
The handlers conflict because onClick toggles while onFocus/onBlur also set open
and blur after a tap can immediately close the tooltip; replace the click
handler with a unified pointer-based handler: remove onClick and add an
onPointerDown/onPointerUp handler on the InfoTooltip trigger that inspects
event.pointerType (touch vs mouse) and for pointerType "touch" toggles the
tooltip via setOpen(prev=>!prev) while keeping onFocus/onBlur to open/close for
keyboard/hover; ensure the pointer handler does not trigger a subsequent blur
override (use event.preventDefault()/event.stopPropagation() as needed) so
setOpen is the single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@extensions/apex-fusion/values.yaml`:
- Line 44: Revert the metrics port value for the apex-fusion extension by
changing the metricsPort setting (metricsPort: 12789) back to the Cardano
standard 12798 so it matches the cardano-node extension and allows Prometheus
scraping to work as expected; update the metricsPort entry in the apex-fusion
values.yaml to 12798.

In `@frontends/dashboard/src/routes/`$namespace/$name/index.tsx:
- Around line 274-289: The type error arises because activePod.name (string |
undefined) isn't narrowed inside the async streamLogs closure; capture the
validated values before defining/awaiting streamLogs by assigning const podName
= activePod.name, const namespace = activePodNamespace, and const containerName
= activePodContainerName after the initial if-check, then use those captured
locals in the async function and when calling streamWorkloadPodLogs so
TypeScript sees them as definite strings; update references to
activePod.name/activePodNamespace/activePodContainerName inside streamLogs to
use podName/namespace/containerName.

---

Nitpick comments:
In `@frontends/dashboard/src/components/ui/InfoTooltip.tsx`:
- Around line 37-39: The handlers conflict because onClick toggles while
onFocus/onBlur also set open and blur after a tap can immediately close the
tooltip; replace the click handler with a unified pointer-based handler: remove
onClick and add an onPointerDown/onPointerUp handler on the InfoTooltip trigger
that inspects event.pointerType (touch vs mouse) and for pointerType "touch"
toggles the tooltip via setOpen(prev=>!prev) while keeping onFocus/onBlur to
open/close for keyboard/hover; ensure the pointer handler does not trigger a
subsequent blur override (use event.preventDefault()/event.stopPropagation() as
needed) so setOpen is the single source of truth.

In `@frontends/dashboard/src/utils/metrics.ts`:
- Around line 140-149: getFirstMatchingMetricValue currently awaits
getInstantMetricValue sequentially; change it to start all getInstantMetricValue
calls concurrently (map queries -> promises) and then await their results (e.g.,
Promise.allSettled or Promise.all) and return the first non-null fulfilled value
found, otherwise return null. Update getFirstMatchingMetricValue to fire
promises for each query, handle failures/settled results safely, and preserve
the Promise<number | null> return type; reference getFirstMatchingMetricValue
and getInstantMetricValue when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 939826f5-5419-4386-8fab-62ac39415039

📥 Commits

Reviewing files that changed from the base of the PR and between 192dcaf and 9a6bfc2.

📒 Files selected for processing (8)
  • extensions/apex-fusion/values.yaml
  • frontends/dashboard/@types/workload.d.ts
  • frontends/dashboard/package.json
  • frontends/dashboard/src/components/ui/InfoTooltip.tsx
  • frontends/dashboard/src/routes/$namespace/$name/index.tsx
  • frontends/dashboard/src/utils/helm.ts
  • frontends/dashboard/src/utils/home/calls.ts
  • frontends/dashboard/src/utils/metrics.ts

Comment thread extensions/apex-fusion/values.yaml
Comment thread frontends/dashboard/src/routes/$namespace/$name/index.tsx
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
frontends/dashboard/src/utils/k8s.ts (1)

48-49: Consider reusing KubeConfig/Exec instead of recreating per call.

On Line 48, loading config per execPodCommand call can become expensive when polling many pods. Prefer injecting or reusing a shared Exec/KubeConfig instance from existing client setup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontends/dashboard/src/utils/k8s.ts` around lines 48 - 49, The code
currently calls loadK8sConfig() and new Exec(kc) inside execPodCommand for every
invocation (expensive when polling many pods); instead reuse a shared
KubeConfig/Exec by either accepting an injected Exec or KubeConfig parameter on
execPodCommand (e.g., execPodCommand(..., exec?: Exec | kc?: KubeConfig)) or
create a lazy module-level singleton that calls loadK8sConfig() and new Exec(kc)
once and reuses it; update execPodCommand to use the provided/shared Exec and
adjust callers to pass the shared instance when available.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@frontends/dashboard/src/utils/k8s.ts`:
- Around line 80-103: The timeout handler currently rejects without releasing
resources; modify the setTimeout callback to call cleanup() before rejecting and
also explicitly close the websocket (ws.close()) to terminate the exec session;
ensure cleanup still clears the timeout and removes listeners (keeps
clearTimeout(timeout) in cleanup) so the timeout branch uses cleanup(),
ws.close(), then reject(new Error(`Timed out waiting for exec command in pod
${namespace}/${podName}`)); keep existing handleClose and handleError logic
unchanged.

---

Nitpick comments:
In `@frontends/dashboard/src/utils/k8s.ts`:
- Around line 48-49: The code currently calls loadK8sConfig() and new Exec(kc)
inside execPodCommand for every invocation (expensive when polling many pods);
instead reuse a shared KubeConfig/Exec by either accepting an injected Exec or
KubeConfig parameter on execPodCommand (e.g., execPodCommand(..., exec?: Exec |
kc?: KubeConfig)) or create a lazy module-level singleton that calls
loadK8sConfig() and new Exec(kc) once and reuses it; update execPodCommand to
use the provided/shared Exec and adjust callers to pass the shared instance when
available.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9625ff0b-8125-4be3-855b-ce4e46110952

📥 Commits

Reviewing files that changed from the base of the PR and between 9a6bfc2 and 4980779.

📒 Files selected for processing (8)
  • .github/workflows/check_extensions.yml
  • extensions/dolos/ci/values-default.yaml
  • frontends/dashboard/@types/workload.d.ts
  • frontends/dashboard/src/routes/$namespace/$name/index.tsx
  • frontends/dashboard/src/utils/home/calls.ts
  • frontends/dashboard/src/utils/k8s.ts
  • frontends/dashboard/src/utils/metrics.ts
  • frontends/dashboard/src/utils/metricsFormat.ts
✅ Files skipped from review due to trivial changes (2)
  • extensions/dolos/ci/values-default.yaml
  • frontends/dashboard/@types/workload.d.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • frontends/dashboard/src/utils/home/calls.ts

Comment on lines +80 to +103
await new Promise<void>((resolve, reject) => {
const timeout = setTimeout(() => {
reject(new Error(`Timed out waiting for exec command in pod ${namespace}/${podName}`));
}, 15000);

const cleanup = () => {
clearTimeout(timeout);
ws.off('close', handleClose);
ws.off('error', handleError);
};

const handleClose = () => {
cleanup();
resolve();
};

const handleError = (error: Error) => {
cleanup();
reject(error);
};

ws.on('close', handleClose);
ws.on('error', handleError);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Timeout path does not clean up websocket/session resources.

On Line 82, timeout rejects without invoking cleanup(), and the websocket is never closed. This can leave in-pod exec sessions running and listeners attached after timeout.

Proposed fix
-  await new Promise<void>((resolve, reject) => {
-    const timeout = setTimeout(() => {
-      reject(new Error(`Timed out waiting for exec command in pod ${namespace}/${podName}`));
-    }, 15000);
+  await new Promise<void>((resolve, reject) => {
+    let settled = false;
+    const finish = (fn: () => void) => {
+      if (settled) return;
+      settled = true;
+      cleanup();
+      fn();
+    };
+    const timeout = setTimeout(() => {
+      try {
+        ws.close();
+      } catch {
+        // no-op
+      }
+      finish(() => reject(new Error(`Timed out waiting for exec command in pod ${namespace}/${podName}`)));
+    }, 15000);
@@
-    const handleClose = () => {
-      cleanup();
-      resolve();
-    };
+    const handleClose = () => finish(resolve);
@@
-    const handleError = (error: Error) => {
-      cleanup();
-      reject(error);
-    };
+    const handleError = (error: Error) => finish(() => reject(error));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontends/dashboard/src/utils/k8s.ts` around lines 80 - 103, The timeout
handler currently rejects without releasing resources; modify the setTimeout
callback to call cleanup() before rejecting and also explicitly close the
websocket (ws.close()) to terminate the exec session; ensure cleanup still
clears the timeout and removes listeners (keeps clearTimeout(timeout) in
cleanup) so the timeout branch uses cleanup(), ws.close(), then reject(new
Error(`Timed out waiting for exec command in pod ${namespace}/${podName}`));
keep existing handleClose and handleError logic unchanged.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@extensions/apex-fusion/templates/statefulset.yaml`:
- Around line 130-133: The StatefulSet mounts metrics.sh via subPath so changes
to the ConfigMap don't trigger pod updates; add a metrics checksum annotation to
the pod template annotations (same pattern as the existing proxy/config
checksums) that computes a checksum of the metrics ConfigMap (the one providing
metrics.sh) so edits to configmap-metrics.yaml roll the pods. Update the
StatefulSet template that contains the metrics-scripts mount (reference
metrics.sh and the pod template annotations block where checksum/proxy and
checksum/config are defined) and apply the identical change to the other
StatefulSet that mounts metrics.sh.

In `@extensions/cardano-node/templates/configmap-metrics.yaml`:
- Around line 36-39: The metrics fetch can hang because curl/wget lack timeouts;
update the command substitution that sets the metrics variable (metrics="$(...")
to include explicit connect and total/read timeouts for curl (e.g.
--connect-timeout and --max-time or -m) and equivalent timeout flags for wget
(e.g. --timeout and --tries=1) so the pipeline fails fast and triggers
append_error "failed to read node metrics" instead of blocking; apply the same
change to the corresponding metrics command in the apex-fusion config as well.

In `@frontends/dashboard/src/routes/`$namespace/$name/index.tsx:
- Around line 225-230: The component currently keeps a cumulative raw log buffer
in logState and calls AnsiToHtml.toHtml on the whole buffer with stream: true,
causing style bleed, semantic misuse of streaming mode, and truncation
corruption; fix by (a) when switching pods (detect via activePodKey) reset or
recreate the AnsiToHtml instance so streaming state is cleared, and stop
reconverting the entire raw buffer on every render, or (b) change AnsiToHtml
instantiation to stream: false for render-time conversion, or better yet convert
incoming chunks incrementally before storing (store converted HTML chunks or
keep per-pod parser state) so only new data is passed to AnsiToHtml.toHtml;
update usages around logState, setLogState, activePodKey and the toHtml call
accordingly to ensure escapes are never sliced mid-sequence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: beecf8fc-1db8-419c-9d56-4319d34fd56b

📥 Commits

Reviewing files that changed from the base of the PR and between 4980779 and bb6211c.

📒 Files selected for processing (10)
  • extensions/apex-fusion/templates/_helpers.tpl
  • extensions/apex-fusion/templates/configmap-metrics.yaml
  • extensions/apex-fusion/templates/statefulset.yaml
  • extensions/cardano-node/templates/_helpers.tpl
  • extensions/cardano-node/templates/configmap-metrics.yaml
  • extensions/cardano-node/templates/statefulset.yaml
  • frontends/dashboard/@types/workload.d.ts
  • frontends/dashboard/src/routes/$namespace/$name/index.tsx
  • frontends/dashboard/src/utils/home/calls.ts
  • frontends/dashboard/src/utils/metricsFormat.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • frontends/dashboard/src/utils/home/calls.ts
  • frontends/dashboard/src/utils/metricsFormat.ts

Comment on lines +130 to +133
- name: metrics-scripts
mountPath: /opt/metis/bin/metrics.sh
subPath: metrics.sh
readOnly: true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if the apex-fusion statefulset.yaml file exists and examine the relevant lines
find . -path "*/extensions/apex-fusion/templates/statefulset.yaml" -type f 2>/dev/null | head -20

Repository: txpipe/metis

Length of output: 107


🏁 Script executed:

# Read the entire file to verify the specific lines and structure
cat -n ./extensions/apex-fusion/templates/statefulset.yaml

Repository: txpipe/metis

Length of output: 11800


🏁 Script executed:

# Check if the cardano-node statefulset.yaml exists and examine relevant sections
find . -path "*/extensions/cardano-node/templates/statefulset.yaml" -type f 2>/dev/null

Repository: txpipe/metis

Length of output: 108


🏁 Script executed:

# Read the cardano-node statefulset.yaml file
cat -n ./extensions/cardano-node/templates/statefulset.yaml

Repository: txpipe/metis

Length of output: 11649


🏁 Script executed:

# Let's verify the Kubernetes behavior regarding subPath ConfigMap mounts and live updates
# by searching web documentation

Repository: txpipe/metis

Length of output: 38


🌐 Web query:

Kubernetes ConfigMap subPath mount live updates behavior

💡 Result:

Kubernetes ConfigMap volumes mounted with subPath do not receive live updates when the ConfigMap changes. The mounted file retains its original content until the Pod (or container) is restarted. This is a documented limitation. Regular ConfigMap volume mounts (without subPath) use symlinks that Kubernetes updates via the kubelet during periodic syncs (typically within ~1-2 minutes), enabling live propagation of changes to the mounted directory. Applications must watch for file changes or poll to detect updates. However, subPath mounts use bind-mounts for security isolation, which bind to the original inode and do not update automatically. This behavior is consistent across ConfigMaps, Secrets, and projected volumes using subPath. Official Kubernetes documentation explicitly states: "A container using a ConfigMap as a subPath volume mount will not receive ConfigMap updates." This note appears in the ConfigMaps concepts page. Workarounds include: - Avoid subPath for dynamic configs; mount the full ConfigMap volume to a separate directory and symlink. - Use tools like Reloader to trigger Pod restarts on ConfigMap changes. - Use immutable ConfigMaps with versioned names and update Deployments. No changes to this behavior appear in recent Kubernetes versions (up to 2026); it remains a known limitation since at least v1.10.

Citations:


Add metrics checksum to roll ConfigMap changes into pod updates.

metrics.sh is mounted via subPath at lines 130-133. Kubernetes does not refresh subPath mounts when ConfigMaps change; the mounted file retains its original content until the pod restarts. This StatefulSet only includes checksums for proxy and configuration (lines 37-42), so edits to configmap-metrics.yaml won't trigger a rollout. Add a metrics checksum annotation to match the existing pattern, and apply the same fix to extensions/cardano-node/templates/statefulset.yaml (lines 130-133 and 220-223).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@extensions/apex-fusion/templates/statefulset.yaml` around lines 130 - 133,
The StatefulSet mounts metrics.sh via subPath so changes to the ConfigMap don't
trigger pod updates; add a metrics checksum annotation to the pod template
annotations (same pattern as the existing proxy/config checksums) that computes
a checksum of the metrics ConfigMap (the one providing metrics.sh) so edits to
configmap-metrics.yaml roll the pods. Update the StatefulSet template that
contains the metrics-scripts mount (reference metrics.sh and the pod template
annotations block where checksum/proxy and checksum/config are defined) and
apply the identical change to the other StatefulSet that mounts metrics.sh.

Comment on lines +36 to +39
metrics="$(curl -s --fail "http://127.0.0.1:{{ .Values.service.metricsPort }}/metrics" || wget -qO- "http://127.0.0.1:{{ .Values.service.metricsPort }}/metrics" || true)"
if [ -z "$metrics" ]; then
append_error "failed to read node metrics"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Locate the configmap-metrics.yaml files mentioned
find . -name "configmap-metrics.yaml" -type f

Repository: txpipe/metis

Length of output: 172


🏁 Script executed:

# Check the exact content of the cardano-node file at lines 36-39
if [ -f "extensions/cardano-node/templates/configmap-metrics.yaml" ]; then
  echo "=== cardano-node configmap-metrics.yaml (lines 30-45) ==="
  sed -n '30,45p' "extensions/cardano-node/templates/configmap-metrics.yaml"
fi

Repository: txpipe/metis

Length of output: 624


🏁 Script executed:

# Check if apex-fusion file exists and its metrics handling
if [ -f "extensions/apex-fusion/templates/configmap-metrics.yaml" ]; then
  echo "=== apex-fusion configmap-metrics.yaml (lines 30-45) ==="
  sed -n '30,45p' "extensions/apex-fusion/templates/configmap-metrics.yaml"
fi

Repository: txpipe/metis

Length of output: 623


🏁 Script executed:

# Search for curl/wget timeout configurations in the repo
echo "=== Searching for timeout flags in curl/wget commands ==="
rg "(curl|wget).*--connect-timeout|--max-time|--timeout|-m|--read-timeout" --type yaml

Repository: txpipe/metis

Length of output: 50368


🏁 Script executed:

# Search for dashboard-related code that might execute metrics.sh
echo "=== Searching for dashboard execution or metrics.sh references ==="
rg "metrics\.sh|exec.*metrics|workload.*details" --type yaml --type js --type ts -A 3

Repository: txpipe/metis

Length of output: 1872


🏁 Script executed:

# Read the full metrics.sh section from cardano-node to check for any timeout wrappers
sed -n '1,50p' "extensions/cardano-node/templates/configmap-metrics.yaml" | tail -30

Repository: txpipe/metis

Length of output: 1230


🏁 Script executed:

# Check if there's a timeout wrapper function or any timeout mechanism in the script
rg "timeout|TIMEOUT" extensions/cardano-node/templates/configmap-metrics.yaml

Repository: txpipe/metis

Length of output: 38


🏁 Script executed:

# Verify the complete metrics fetch section with more context
sed -n '30,50p' "extensions/cardano-node/templates/configmap-metrics.yaml"

Repository: txpipe/metis

Length of output: 948


🏁 Script executed:

# Get the full metrics.sh script to check for any outer timeout wrapper
wc -l "extensions/cardano-node/templates/configmap-metrics.yaml"

Repository: txpipe/metis

Length of output: 116


🏁 Script executed:

# Check for timeout at script/function level
sed -n '1,100p' "extensions/cardano-node/templates/configmap-metrics.yaml"

Repository: txpipe/metis

Length of output: 7677


Add timeouts to the local metrics fetch.

The curl/wget fallback at lines 36-39 lacks connect and read timeouts. If the node metrics endpoint stalls, the script will block indefinitely. Since the dashboard executes this script synchronously via bash -lc, a hung read will stall the workload details page. Add explicit timeout flags to both commands and apply the same fix to extensions/apex-fusion/templates/configmap-metrics.yaml.

Current code
metrics="$(curl -s --fail "http://127.0.0.1:{{ .Values.service.metricsPort }}/metrics" || wget -qO- "http://127.0.0.1:{{ .Values.service.metricsPort }}/metrics" || true)"
if [ -z "$metrics" ]; then
  append_error "failed to read node metrics"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@extensions/cardano-node/templates/configmap-metrics.yaml` around lines 36 -
39, The metrics fetch can hang because curl/wget lack timeouts; update the
command substitution that sets the metrics variable (metrics="$(...") to include
explicit connect and total/read timeouts for curl (e.g. --connect-timeout and
--max-time or -m) and equivalent timeout flags for wget (e.g. --timeout and
--tries=1) so the pipeline fails fast and triggers append_error "failed to read
node metrics" instead of blocking; apply the same change to the corresponding
metrics command in the apex-fusion config as well.

Comment on lines +225 to 230
const [logState, setLogState] = useState(() => ({
podKey: activePodKey,
value: '',
}));
const logs = logState.podKey === activePodKey ? logState.value : '';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "index.tsx" -path "*/routes/*" | grep -E "namespace.*name"

Repository: txpipe/metis

Length of output: 181


🏁 Script executed:

cat -n frontends/dashboard/src/routes/\$namespace/\$name/index.tsx | head -100

Repository: txpipe/metis

Length of output: 5640


🏁 Script executed:

cat -n frontends/dashboard/src/routes/\$namespace/\$name/index.tsx | sed -n '150,180p'

Repository: txpipe/metis

Length of output: 1193


🏁 Script executed:

cat -n frontends/dashboard/src/routes/\$namespace/\$name/index.tsx | sed -n '220,240p'

Repository: txpipe/metis

Length of output: 1045


🏁 Script executed:

cat -n frontends/dashboard/src/routes/\$namespace/\$name/index.tsx | sed -n '270,290p'

Repository: txpipe/metis

Length of output: 692


🏁 Script executed:

cat -n frontends/dashboard/src/routes/\$namespace/\$name/index.tsx | sed -n '580,600p'

Repository: txpipe/metis

Length of output: 1008


🏁 Script executed:

# Check if ansi-to-html is used and how
rg "ansi-to-html|AnsiToHtml|toHtml" frontends/dashboard/src/routes/\$namespace/\$name/index.tsx -B2 -A2

Repository: txpipe/metis

Length of output: 588


🏁 Script executed:

# Check the package.json to see ansi-to-html version
cat frontends/dashboard/package.json | grep -A 2 -B 2 "ansi-to-html"

Repository: txpipe/metis

Length of output: 216


🏁 Script executed:

# Check if there's any reset/close method being called
rg "converter\.(reset|close|clear)" frontends/dashboard/src/routes/\$namespace/\$name/index.tsx

Repository: txpipe/metis

Length of output: 38


🏁 Script executed:

# Check ansi-to-html library documentation in node_modules or docs
find . -name "ansi-to-html" -type d | head -5

Repository: txpipe/metis

Length of output: 38


🏁 Script executed:

# Search for any prior converter state management
rg "stream.*true|converter" frontends/dashboard/src/routes/\$namespace/\$name/index.tsx -B2 -A2

Repository: txpipe/metis

Length of output: 407


🌐 Web query:

ansi-to-html stream mode state behavior reset method

💡 Result:

In the ansi-to-html library (rburns/ansi-to-html), stream mode is enabled by setting the stream: true option when creating a Convert instance. This saves style state (current colors, attributes like bold/underline) across multiple calls to toHtml, enabling proper handling of partial ANSI streams where chunks may split escape sequences or text. When stream: false (default), each toHtml call processes input independently: it starts from default/reset state, applies ANSI codes within the input, and closes all open spans at the end. The reset method (handling ANSI \x1b[0m or similar reset codes) sets colors/attributes to defaults (fg: '#FFF', bg: '#000' by default, customizable). In stream mode, this updates the persistent state for subsequent calls. In non-stream mode, it only affects that call's output. CLI usage (ansi-to-html command) automatically sets stream: true for streaming input like git log | ansi-to-html. No explicit public "reset" method exists; state reset occurs via ANSI reset codes or reinstantiating the Convert object. Issues note challenges with incomplete escape sequences in streams, but core behavior maintains state continuity in stream mode. Sources confirm via README/docs across GitHub, npm pages: stream option explicitly "Save style state across invocations of toHtml".

Citations:


🏁 Script executed:

# Check if converter is used elsewhere in codebase
rg "converter\." frontends/dashboard/

Repository: txpipe/metis

Length of output: 218


Don't reparse the whole log buffer with a streaming ANSI parser on every render.

AnsiToHtml is configured with stream: true (lines 161–165), which preserves parser state across toHtml() calls. The component stores the full raw log buffer and reconverts that entire string on every render (line 586), causing:

  1. Style bleed across pod switches: When switching pods (line 229), the converter's state isn't reset, so colors and formatting from the previous pod apply to the new pod's logs.
  2. Semantic mismatch: Stream mode expects incremental chunk processing, not reprocessing the same cumulative buffer repeatedly.
  3. Truncation corruption: The 10k-character trim (line 280) uses slice(-10000) on raw ANSI text, which can bisect multi-byte escape sequences like \x1b[38;5;123m, causing malformed output.

Either disable streaming for render-time conversion, convert only incremental chunks before storing them, or reset the converter instance when switching pods.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontends/dashboard/src/routes/`$namespace/$name/index.tsx around lines 225 -
230, The component currently keeps a cumulative raw log buffer in logState and
calls AnsiToHtml.toHtml on the whole buffer with stream: true, causing style
bleed, semantic misuse of streaming mode, and truncation corruption; fix by (a)
when switching pods (detect via activePodKey) reset or recreate the AnsiToHtml
instance so streaming state is cleared, and stop reconverting the entire raw
buffer on every render, or (b) change AnsiToHtml instantiation to stream: false
for render-time conversion, or better yet convert incoming chunks incrementally
before storing (store converted HTML chunks or keep per-pod parser state) so
only new data is passed to AnsiToHtml.toHtml; update usages around logState,
setLogState, activePodKey and the toHtml call accordingly to ensure escapes are
never sliced mid-sequence.

@gonzalezzfelipe gonzalezzfelipe merged commit b230e64 into main Apr 16, 2026
15 checks passed
@gonzalezzfelipe gonzalezzfelipe deleted the feat/metrics-for-cnodes branch April 16, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants