Skip to content

Conversation

@schmikei
Copy link
Contributor

@schmikei schmikei commented Oct 29, 2025

This modernizes the CouchDB Mixin to use newer libraries.

Overview
image
image
image
image

Nodes:
image
image
image

Logs:
image

@schmikei schmikei marked this pull request as ready for review October 29, 2025 21:19
@schmikei schmikei requested a review from a team as a code owner October 29, 2025 21:19
Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to head out in a second so I'll publish my comments so-far! Going to pick it back up in the morning
@aalhour would also appreciate a second pair of eyes on this one :)

Copy link
Contributor

@aalhour aalhour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I caught some issues, thanks for pushing this.

expr: |||
sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.*"}[5m])) > %(alertsWarning4xxResponseCodes5m)s
||| % $._config,
sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the 4.* regex more general than the 4..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/grafana/jsonnet-libs/actions/runs/19470908585/job/55717683328?pr=1522

Unfortunately its considered a messy selector. For some reason the lint passes locally for myself but within Ci it fails

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this failing is due to the Pint linter rules that takes a bit more setup to install locally. I've no strong feelings about this either way. On one hand it's good to be specific to the three digit setup, on the other it could be an issue if mixed with text, e.g. 404NotFound as a label.

I doubt we will run into the latter, so for now I think 4.. is perfectly fine

cluster+: {
allValue: '.*',
},
couchb_cluster+: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couchb --> couchdb, missing d character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep thank you for the catch!

{
alert: 'CouchDBUnhealthyCluster',
expr: |||
min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable) < %(alertsCriticalClusterIsUnstable5m)s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the alerts use the filteringSelector from the config? Just wondering.

expr: |||
  min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable{%(filteringSelector)s}) < %(alertsCriticalClusterIsUnstable5m)s
||| % this.config,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went ahead and implemented these!

alertsCriticalClusterIsUnstable5m: 1, //1 is stable
alertsWarning4xxResponseCodes5m: 5,
alertsCritical5xxResponseCodes5m: 0,
alertsWarningRequestLatency5m: 500, //ms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in milliseconds, which means that the alert threshold is broken.

{
alert: 'CouchDBHighRequestLatency',
expr: |||
sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) > %(alertsCriticalRequestLatency5m)s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that you didn't change this but I think there is a bug in the calculation, sum / count is in seconds but the config declares the threshold in milliseconds (500), we should multiply the value before the compare operator by 1000. I think this is the second time I catch this bug, which is not on you but it's there in the original implementations.

expr: |||
  sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) * 1000 > %(alertsWarningRequestLatency5m)s
||| % this.config,

Alternatively, we can change the alert threshold in the config. I am fine with either solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep didn't really audit the content of the queries too much. Will go ahead and take a stab at a fix

unit: 'requests',
sources: {
prometheus: {
expr: 'sum by(' + groupLabelAggTerm + ') (couchdb_httpd_status_codes{%(queriesSelector)s, code=~"[45].*"})',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not using increase(...) like the goodResponseStatuses signal above it? Also neither interval nor offset. At least when I read them, I get the feeling that they are complementary but their expressions are not really consistent.

+ g.query.prometheus.withInstant(true)
+ g.query.prometheus.withFormat('timeseries'),
])
+ g.panel.gauge.queryOptions.withDatasource('prometheus', '${' + this.grafana.variables.datasources.prometheus.name + '}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to specify the datasource here? I think it should be derived from the board config and selectors.

expr: 'couchdb_httpd_view_reads_total{%(queriesSelector)s}',
},
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that these signals don't specify a rate(...)[$__rate_interval] like other counterparts in overview.libsonnet, does the function also break the panels for you? At least for me with the databricks data, it can be a gauge type and it might break it. But here i see that these are counters 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are counters so the rate gets auto-added from my experience!

Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this another pass while checking with the sample-app and I realised the node dashboard is pretty cluttered and the panels hard to read.
I've left a couple comments but I think it's a more overarching issue for the node views so happy to set up a call to discuss between us :)

expr: |||
sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.*"}[5m])) > %(alertsWarning4xxResponseCodes5m)s
||| % $._config,
sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this failing is due to the Pint linter rules that takes a bit more setup to install locally. I've no strong feelings about this either way. On one hand it's good to be specific to the three digit setup, on the other it could be an issue if mixed with text, e.g. 404NotFound as a label.

I doubt we will run into the latter, so for now I think 4.. is perfectly fine

Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple more minor things, but it's almost there from my perspective!

sum by(job, instance) (increase(couchdb_httpd_status_codes{%(filteringSelector)s,code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s
||| % this.config,
sum by(job, instance) (increase(couchdb_httpd_status_codes{%(filteringSelector)s}[5m])) > %(alertsWarning4xxResponseCodes5m)s
||| % (this.config { filteringSelector: if this.config.filteringSelector != '' then this.config.filteringSelector + ',code=~"4.."' else 'code=~"4.."' }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only here twice so I don't feel strongly about it, but if we need to do similar in other mixins then let's extract that to a variable

Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for all the iteration on this one Keith!

Copy link
Contributor

@aalhour aalhour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot!

"label": "Level",
"multi": true,
"name": "level",
"query": "label_values({,job=~\"$job\",couchdb_cluster=~\"$couchdb_cluster\",cluster=~\"$cluster\"}, level)",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dasomeone @aalhour I'm noticing that with the empty filteringSelector we get this invalid label. Is this something we're okay with expecting that people are going to probably overwrite the filteringSelector? Otherwise just setting it to something like job=~".+" could also be appropriate maybe?

This only affects logs dashboards as far as I can tell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants