OCPEDGE-2705: Add TNF Pacemaker metrics controller#1634
Conversation
Implement a metrics controller that projects PacemakerCluster CR conditions into Prometheus gauges for the TNF monitoring pipeline. The controller shares the existing pacemakerInformer with HealthCheck, registers 20 metric families (~53 series on a 2-node cluster), and uses table-driven sync logic with GaugeVec.Reset() to prevent stale data from removed nodes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@fonta-rh: This pull request references OCPEDGE-2705 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughA new ChangesPacemaker Metrics Controller
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/tnf/pkg/metriccontroller/controller_test.go (1)
21-68: 💤 Low valueConsider extracting fakeInformer to a shared test utility.
The comment on line 22 notes this is copied from
pacemaker/healthcheck_test.go. To reduce duplication, consider extracting this to a shared test helper package (e.g.,pkg/tnf/pkg/testutil) so both test files can reuse the same implementation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/pkg/metriccontroller/controller_test.go` around lines 21 - 68, The fakeInformer struct and createFakeInformer function are duplicated between controller_test.go and pacemaker/healthcheck_test.go as noted in the comment. Extract both the createFakeInformer function and the fakeInformer type definition (along with all its method implementations) to a shared test utility package such as pkg/tnf/pkg/testutil. Update both test files to import and use the shared implementation instead of maintaining duplicate code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/tnf/operator/starter.go`:
- Around line 320-328: The type assertion on legacyregistry.DefaultGatherer to
metrics.KubeRegistry is unsafe and will cause a panic if the assertion fails at
runtime. Replace the bare type assertion with the comma-ok idiom by capturing
both the converted value and a boolean success indicator, then check the boolean
before proceeding. If the assertion fails, handle the error appropriately (e.g.,
log an error and return or exit) before creating the metricsController with
NewPacemakerMetricsController.
---
Nitpick comments:
In `@pkg/tnf/pkg/metriccontroller/controller_test.go`:
- Around line 21-68: The fakeInformer struct and createFakeInformer function are
duplicated between controller_test.go and pacemaker/healthcheck_test.go as noted
in the comment. Extract both the createFakeInformer function and the
fakeInformer type definition (along with all its method implementations) to a
shared test utility package such as pkg/tnf/pkg/testutil. Update both test files
to import and use the shared implementation instead of maintaining duplicate
code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6cfb780f-0e07-4d69-909e-d7f8d653442a
📒 Files selected for processing (4)
pkg/tnf/operator/starter.gopkg/tnf/pkg/metriccontroller/controller.gopkg/tnf/pkg/metriccontroller/controller_test.gopkg/tnf/pkg/metriccontroller/metrics.go
| // Create and start the metrics controller, sharing the same informer | ||
| klog.Infof("creating Pacemaker metrics controller") | ||
| metricsController := metriccontroller.NewPacemakerMetricsController( | ||
| pacemakerInformer, | ||
| controllerContext.EventRecorder, | ||
| legacyregistry.DefaultGatherer.(metrics.KubeRegistry), | ||
| ) | ||
| klog.Infof("starting Pacemaker metrics controller") | ||
| go metricsController.Run(ctx, 1) |
There was a problem hiding this comment.
Bare type assertion can panic the operator.
Line 325 uses an unsafe type assertion without checking success. If legacyregistry.DefaultGatherer does not implement metrics.KubeRegistry at runtime, the operator will panic and crash.
🔒 Proposed fix using comma-ok idiom
// Create and start the metrics controller, sharing the same informer
klog.Infof("creating Pacemaker metrics controller")
+ registry, ok := legacyregistry.DefaultGatherer.(metrics.KubeRegistry)
+ if !ok {
+ klog.Errorf("failed to cast DefaultGatherer to KubeRegistry, metrics controller will not start")
+ return
+ }
metricsController := metriccontroller.NewPacemakerMetricsController(
pacemakerInformer,
controllerContext.EventRecorder,
- legacyregistry.DefaultGatherer.(metrics.KubeRegistry),
+ registry,
)
klog.Infof("starting Pacemaker metrics controller")
go metricsController.Run(ctx, 1)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Create and start the metrics controller, sharing the same informer | |
| klog.Infof("creating Pacemaker metrics controller") | |
| metricsController := metriccontroller.NewPacemakerMetricsController( | |
| pacemakerInformer, | |
| controllerContext.EventRecorder, | |
| legacyregistry.DefaultGatherer.(metrics.KubeRegistry), | |
| ) | |
| klog.Infof("starting Pacemaker metrics controller") | |
| go metricsController.Run(ctx, 1) | |
| // Create and start the metrics controller, sharing the same informer | |
| klog.Infof("creating Pacemaker metrics controller") | |
| registry, ok := legacyregistry.DefaultGatherer.(metrics.KubeRegistry) | |
| if !ok { | |
| klog.Errorf("failed to cast DefaultGatherer to KubeRegistry, metrics controller will not start") | |
| return | |
| } | |
| metricsController := metriccontroller.NewPacemakerMetricsController( | |
| pacemakerInformer, | |
| controllerContext.EventRecorder, | |
| registry, | |
| ) | |
| klog.Infof("starting Pacemaker metrics controller") | |
| go metricsController.Run(ctx, 1) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/operator/starter.go` around lines 320 - 328, The type assertion on
legacyregistry.DefaultGatherer to metrics.KubeRegistry is unsafe and will cause
a panic if the assertion fails at runtime. Replace the bare type assertion with
the comma-ok idiom by capturing both the converted value and a boolean success
indicator, then check the boolean before proceeding. If the assertion fails,
handle the error appropriately (e.g., log an error and return or exit) before
creating the metricsController with NewPacemakerMetricsController.
Both metriccontroller and pacemaker tests had identical 47-line fakeInformer implementations. Move to pkg/tnf/internal/testutil/ so future interface changes only need one update. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/tnf/internal/testutil/fakeinformer.go`:
- Around line 19-21: The `store.Add(obj)` call in the function containing the if
statement is suppressing the error with the blank identifier (_), which can hide
test setup failures. Modify the function signature to return an error type, and
instead of discarding the error, return it if `store.Add(obj)` fails. Then
update all five call sites of this function across controller_test.go and
healthcheck_test.go to handle the returned error appropriately by checking for
nil and failing the test if an error is returned.
In `@pkg/tnf/pkg/pacemaker/healthcheck_test.go`:
- Around line 53-59: The test setup in createTestHealthCheckWithMockStatus has
an implicit coupling between how the fake informer indexes objects (by
PacemakerCluster.Name) and how getPacemakerStatus fetches them (by
PacemakerClusterResourceName constant). Make this contract explicit by setting
mockStatus.Name to match the expected lookup key before passing it to
testutil.CreateFakeInformer(mockStatus), ensuring the mock object can be found
when getPacemakerStatus attempts to retrieve it by its fixed resource name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9c1e7295-27cc-4c14-9264-76fe10a05aa9
📒 Files selected for processing (3)
pkg/tnf/internal/testutil/fakeinformer.gopkg/tnf/pkg/metriccontroller/controller_test.gopkg/tnf/pkg/pacemaker/healthcheck_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/tnf/pkg/metriccontroller/controller_test.go
| if obj != nil { | ||
| _ = store.Add(obj) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify all call sites are updated if CreateFakeInformer signature changes.
rg -n -C2 '\bCreateFakeInformer\s*\(' --type goRepository: openshift/cluster-etcd-operator
Length of output: 3025
🏁 Script executed:
cat -n pkg/tnf/internal/testutil/fakeinformer.goRepository: openshift/cluster-etcd-operator
Length of output: 2666
🏁 Script executed:
# Check if the function signature has already been changed to return error
rg -A 15 'func CreateFakeInformer' pkg/tnf/internal/testutil/fakeinformer.goRepository: openshift/cluster-etcd-operator
Length of output: 489
Handle store.Add errors instead of discarding them
Line 20 suppresses the error from store.Add(obj) with _, which can silently hide test setup failures. Per coding guidelines, "Never ignore error returns."
The fix requires changing the function signature to return an error and updating all 5 call sites in controller_test.go and healthcheck_test.go to handle the new error return.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/internal/testutil/fakeinformer.go` around lines 19 - 21, The
`store.Add(obj)` call in the function containing the if statement is suppressing
the error with the blank identifier (_), which can hide test setup failures.
Modify the function signature to return an error type, and instead of discarding
the error, return it if `store.Add(obj)` fails. Then update all five call sites
of this function across controller_test.go and healthcheck_test.go to handle the
returned error appropriately by checking for nil and failing the test if an
error is returned.
Source: Coding guidelines
| func createTestHealthCheckWithMockStatus(t *testing.T, mockStatus *pacmkrv1.PacemakerCluster) *HealthCheck { | ||
| return &HealthCheck{ | ||
| operatorClient: createFakeOperatorClient(), | ||
| kubeClient: fake.NewSimpleClientset(), | ||
| eventRecorder: events.NewInMemoryRecorder("test", clock.RealClock{}), | ||
| pacemakerInformer: createFakeInformer(mockStatus), | ||
| pacemakerInformer: testutil.CreateFakeInformer(mockStatus), | ||
| recordedEvents: make(map[string]time.Time), |
There was a problem hiding this comment.
Make the informer key contract explicit in test setup
CreateFakeInformer indexes by PacemakerCluster.Name, while getPacemakerStatus fetches with PacemakerClusterResourceName (cluster). Set the fixture name here to avoid implicit coupling and brittle tests.
Suggested hardening
func createTestHealthCheckWithMockStatus(t *testing.T, mockStatus *pacmkrv1.PacemakerCluster) *HealthCheck {
+ status := mockStatus.DeepCopy()
+ if status.Name == "" {
+ status.Name = PacemakerClusterResourceName
+ }
return &HealthCheck{
operatorClient: createFakeOperatorClient(),
kubeClient: fake.NewSimpleClientset(),
eventRecorder: events.NewInMemoryRecorder("test", clock.RealClock{}),
- pacemakerInformer: testutil.CreateFakeInformer(mockStatus),
+ pacemakerInformer: testutil.CreateFakeInformer(status),
recordedEvents: make(map[string]time.Time),
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func createTestHealthCheckWithMockStatus(t *testing.T, mockStatus *pacmkrv1.PacemakerCluster) *HealthCheck { | |
| return &HealthCheck{ | |
| operatorClient: createFakeOperatorClient(), | |
| kubeClient: fake.NewSimpleClientset(), | |
| eventRecorder: events.NewInMemoryRecorder("test", clock.RealClock{}), | |
| pacemakerInformer: createFakeInformer(mockStatus), | |
| pacemakerInformer: testutil.CreateFakeInformer(mockStatus), | |
| recordedEvents: make(map[string]time.Time), | |
| func createTestHealthCheckWithMockStatus(t *testing.T, mockStatus *pacmkrv1.PacemakerCluster) *HealthCheck { | |
| status := mockStatus.DeepCopy() | |
| if status.Name == "" { | |
| status.Name = PacemakerClusterResourceName | |
| } | |
| return &HealthCheck{ | |
| operatorClient: createFakeOperatorClient(), | |
| kubeClient: fake.NewSimpleClientset(), | |
| eventRecorder: events.NewInMemoryRecorder("test", clock.RealClock{}), | |
| pacemakerInformer: testutil.CreateFakeInformer(status), | |
| recordedEvents: make(map[string]time.Time), | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/pkg/pacemaker/healthcheck_test.go` around lines 53 - 59, The test
setup in createTestHealthCheckWithMockStatus has an implicit coupling between
how the fake informer indexes objects (by PacemakerCluster.Name) and how
getPacemakerStatus fetches them (by PacemakerClusterResourceName constant). Make
this contract explicit by setting mockStatus.Name to match the expected lookup
key before passing it to testutil.CreateFakeInformer(mockStatus), ensuring the
mock object can be found when getPacemakerStatus attempts to retrieve it by its
fixed resource name.
Summary
pacemakerInformerwith HealthCheck — zero additional API server loadGaugeVec.Reset()to prevent stale data from removed nodesDetails
New package
pkg/tnf/pkg/metriccontroller/with three files:metrics.goconditionToFloat64helpercontroller.gometrics.KubeRegistryinjection, sync logiccontroller_test.goWiring in
pkg/tnf/operator/starter.go— controller created after HealthCheck, sharingpacemakerInformer.Key design decisions
EtcdCertSignerControllerpattern)== 0vsabsent())StabilityLevel: metrics.ALPHA— explicit on all metricsMetrics exposed
GaugeGaugeVecnodeGaugeVecnode,resourceTest plan
go build ./...— all binaries compilego test ./pkg/tnf/pkg/metriccontroller/ -v -count=5— 55/55 pass, no registration pollutiongo test ./pkg/tnf/... -count=1— 454 existing TNF tests pass (no regressions)make build— cleanmake test-unit— all tests pass with-race:8443/metricsRelated
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
PacemakerCluster, including cluster conditions, per-node online status, and per-resource state.Tests