Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 18 additions & 3 deletions extensions/apex-fusion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,13 +110,28 @@ node:

Write the artifacts to Vault before installing or upgrading the chart:

If Vault is only reachable inside the cluster, use the local `vault` CLI with a
port-forward first:

```shell
kubectl get svc -n control-plane
kubectl -n control-plane port-forward service/control-plane-vault 8200:8200
export VAULT_ADDR=http://localhost:8200
```

Then log in locally using your normal operator auth flow and upload the runtime
artifacts from local files:

```shell
vault kv put kv/apex-fusion/prime-mainnet-bp/block-producer \
kes.skey=@kes.skey \
vrf.skey=@vrf.skey \
op.cert=@op.cert
kes.skey=@/absolute/path/to/kes.skey \
vrf.skey=@/absolute/path/to/vrf.skey \
op.cert=@/absolute/path/to/op.cert
```

Only `kes.skey`, `vrf.skey`, and `op.cert` belong in Vault for this workflow.
Keep cold keys and the operational certificate counter outside the cluster.

When `node.blockProducer.enabled=true` the chart creates a namespace-local
`vault-auth` service account and a `VaultStaticSecret` that references the
shared `control-plane/default` Vault auth. Vault Secrets Operator syncs the
Expand Down
5 changes: 4 additions & 1 deletion extensions/apex-fusion/templates/configmap-metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,9 @@ data:

node_role="${METIS_NODE_ROLE:-relay}"
if [ "$node_role" = "block-producer" ]; then
leader_count="${leader_count:-0}"
adopted_count="${adopted_count:-0}"
forged_count="${forged_count:-0}"
pool_id="${METIS_POOL_ID:-}"
vrf_skey_path="${METIS_VRF_SKEY_PATH:-}"
op_cert_path="${METIS_OP_CERT_PATH:-}"
Expand All @@ -172,7 +175,7 @@ data:
append_error "vrf signing key is not readable: $vrf_skey_path"
fi

if [ "${CARDANO_NETWORK:-}" = "mainnet" ]; then
if [ "${CARDANO_NETWORK:-}" = "mainnet" ] || [ "${CARDANO_NETWORK:-}" = "prime-mainnet" ] || [ "${CARDANO_NETWORK:-}" = "vector-mainnet" ]; then
network_arg_name="--mainnet"
elif [ -n "${METIS_CARDANO_NETWORK_MAGIC:-}" ]; then
network_arg_name="--testnet-magic"
Expand Down
3 changes: 3 additions & 0 deletions extensions/cardano-node/templates/configmap-metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,9 @@ data:

node_role="${METIS_NODE_ROLE:-relay}"
if [ "$node_role" = "block-producer" ]; then
leader_count="${leader_count:-0}"
adopted_count="${adopted_count:-0}"
forged_count="${forged_count:-0}"
pool_id="${METIS_POOL_ID:-}"
vrf_skey_path="${METIS_VRF_SKEY_PATH:-}"
op_cert_path="${METIS_OP_CERT_PATH:-}"
Expand Down
10 changes: 10 additions & 0 deletions frontends/dashboard/src/routes/$namespace/$name/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,8 @@ const metricDescriptions = {
'Current epoch leader schedule luck, computed as scheduled leadership slots divided by the ideal expected slot count.',
nextBlock:
'Time remaining until the next scheduled leadership slot in the current epoch.',
forgedAdoptedLocal:
'Local node counters for blocks forged and adopted since startup. Useful for debugging, but not proof of canonical chain inclusion.',
kesSummary:
'Current KES period and how many periods remain before rotation is required.',
opCertSummary:
Expand Down Expand Up @@ -656,6 +658,14 @@ function WorkloadIdInfo() {
)}
description={metricDescriptions.nextBlock}
/>
<InfoCard
label="Forged / adopted (local)"
value={formatCountPair(
cardanoNodeMetrics.forgedCount,
cardanoNodeMetrics.adoptedCount,
)}
description={metricDescriptions.forgedAdoptedLocal}
/>
<InfoCard
label="KES current / remaining"
value={formatKesSummary(
Expand Down
24 changes: 24 additions & 0 deletions skills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Skills

These files are intended for agents guiding users through Metis cluster operations.

All skills assume the repository in this workspace is the source of truth for the Helm charts and expected behavior.

Use these skills as follows:

- `kubernetes-extension-discovery.md`: discover which Metis extensions are already installed and whether prerequisites like `control-plane` are missing.
- `kubernetes-storage-and-prereqs.md`: validate storage classes, PVC behavior, node readiness, and scheduling prerequisites before installs or upgrades.
- `cardano-relay-setup.md`: install and validate a Cardano relay workload first.
- `cardano-block-producer-upgrade.md`: upgrade an existing relay to block-producer mode from an existing pool, using debug mode first, with explicit producer topology guidance.
- `cardano-block-producer-verification.md`: explain what can be verified today from the dashboard and what still requires external confirmation.
- `cardano-block-producer-troubleshooting.md`: diagnose cases where a producer looks healthy locally but recent pool blocks are missing from the canonical external chain view.
- `cardano-node-metrics-access.md`: read raw node metrics and the derived Metis metrics payload directly from a running pod via `kubectl exec`.
- `supernode-dashboard-port-forward.md`: expose the user-facing `supernode-dashboard` locally with `kubectl port-forward`, with Grafana and Prometheus as supporting debug paths.

Operational assumptions:

- `control-plane` is expected to exist for normal workflows.
- Agents should still verify cluster state rather than assume it blindly.
- Cold keys stay offline.
- Only runtime block-producer material belongs in-cluster.
- Relay topology can start on `image-default`, but producer topology should be explicit, relay-only, and private.
217 changes: 217 additions & 0 deletions skills/cardano-block-producer-troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
# Cardano Block Producer Troubleshooting

## Goal

Help an agent debug cases where a block producer looks healthy locally but recent pool blocks are not appearing on an external canonical chain view.

## Common Symptom Pattern

This skill applies when most or all of the following are true:

- producer pod is running
- `forging enabled` is true
- KES and op-cert metrics look healthy
- schedule metrics are present
- peer counts are non-zero
- the relay is healthy
- but an external canonical chain source shows no recent blocks for the pool

This should be interpreted as a contradiction between:

- local producer signals
- external canonical-chain observation

## What Local Metrics Mean

The current dashboard can already show local node counters:

- `forgedCount`
- `adoptedCount`

These come from raw node metrics:

- `cardano_node_metrics_Forge_forged_*`
- `cardano_node_metrics_Forge_adopted_*`

Important limitation:

- these are local node counters
- they do not prove that a block stayed on the canonical chain

They are useful for debugging, not for final production confirmation.

## Initial Hypotheses To Challenge

When operators suspect this issue, they often blame:

- no peers
- broken producer topology
- epoch boundary problems

Those are reasonable hypotheses, but they should be checked rather than assumed.

## Debugging Flow

### 1. Check Producer Runtime

Confirm the producer is actually running in producer mode:

- `CARDANO_BLOCK_PRODUCER=true`
- runtime forging args are present
- `forging enabled` is true

Useful checks:

```bash
kubectl -n <namespace> exec <pod> -c <container> -- sh -lc 'env | grep -E "^(CARDANO|METIS)_"'
kubectl -n <namespace> exec <pod> -c <container> -- sh -lc 'curl -s --fail http://127.0.0.1:<metricsPort>/metrics || wget -qO- http://127.0.0.1:<metricsPort>/metrics'
```

### 2. Check KES And OP Cert State

Confirm:

- `KES current / remaining`
- `KES expiration`
- `OP Cert disk | chain`

If these are unhealthy, fix them before chasing peers or topology.

### 3. Check Current Epoch Schedule

Confirm:

- `Leader`
- `Ideal`
- `Luck`
- `Next Block in`

If these exist and look reasonable, the epoch boundary is probably not the main issue.

### 4. Check Local Forging Counters

Inspect:

- `Forge_node_is_leader_*`
- `Forge_forged_*`
- `Forge_adopted_*`

Interpretation:

- if these are all zero, the producer may not be reaching actual leadership wins
- if they are increasing, the producer is doing local work even if external sources show no recent pool blocks

### 5. Check Producer Topology

Inspect the active `topology.json` inside the producer pod.

Recommended producer pattern:

- explicit topology
- relay-only `localRoots`
- empty `publicRoots`
- `useLedgerAfterSlot = -1`

Avoid:

- image-default topology on producers
- mixed local roots where one target is not operator-controlled
- public relay roots on producers

### 6. Check Actual Connections, Not Just Counts

Peer counts alone are not enough. Inspect the actual active sockets.

Useful checks:

```bash
kubectl -n <namespace> exec <pod> -c <container> -- sh -lc 'ss -tnp 2>/dev/null || netstat -tnp 2>/dev/null || true'
```

Confirm the producer is connected to the intended relay path.

### 7. Check Relay Health

Confirm the relay:

- is in sync
- has healthy external peers
- is actually receiving the producer connection

If the relay is unhealthy, the producer may still look locally healthy while blocks fail to propagate well.

### 8. Compare Against External Canonical Chain View

Use an external source that is trusted for canonical-chain observation.

If there is an external blockfrost source to check, one can do:

```bash
curl '$EXTERNAL_BLOCKFROST/pools/<pool-id>'
curl '$EXTERNAL_BLOCKFROST/pools/<pool-id>/blocks?order=desc&count=5'
```

Important limitation:

- this debugging flow still depends on an external source to confirm canonical pool blocks
- until native block outcome tracking exists, external confirmation is required

## How To Interpret Contradictions

### Case A. No peers, no forge activity

Likely causes:

- broken topology
- wrong relay target
- network/firewall issue

### Case B. Peers exist, schedule exists, forging enabled is true, but local forge counters stay zero

Likely causes:

- no winning slots in the observed window
- wrong pool identity or pool registration mismatch

### Case C. Peers exist, schedule exists, local forged/adopted counters increase, but external chain shows no recent pool blocks

This is the most subtle and important case.

Likely interpretation:

- the producer is forging or locally adopting blocks
- but those blocks are not surviving onto the canonical public chain

Probable causes to investigate next:

- propagation path / topology design
- producer favoring the wrong local root
- relay path not being the intended one
- non-canonical outcomes such as ghosted/stolen behavior

Without native block outcome tracking, this cannot be classified exactly from local metrics alone.

## Recommended Fix Pattern

When topology is suspicious, simplify it aggressively.

Preferred producer topology:

- connect only to operator-controlled relay services
- remove extra custom local roots unless they are truly part of the intended architecture
- keep `publicRoots` empty
- keep `useLedgerAfterSlot = -1`

If using Metis managed topology, prefer `relay-service` mode.

## What This Skill Does Not Replace

This skill does not replace true local block outcome verification.

It helps agents distinguish between:

- basic runtime failure
- topology/connectivity issues
- local forge/adopt signals without canonical-chain confirmation

Until native block outcome tracking exists, final production confirmation still depends on an external source.
Loading
Loading