-
Notifications
You must be signed in to change notification settings - Fork 176
chore: Modernize the Apache CouchDB mixin #1522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to head out in a second so I'll publish my comments so-far! Going to pick it back up in the morning
@aalhour would also appreciate a second pair of eyes on this one :)
…ouchdb-modernization
…nnet-libs into chore/couchdb-modernization
aalhour
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I caught some issues, thanks for pushing this.
| expr: ||| | ||
| sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.*"}[5m])) > %(alertsWarning4xxResponseCodes5m)s | ||
| ||| % $._config, | ||
| sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the 4.* regex more general than the 4..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/grafana/jsonnet-libs/actions/runs/19470908585/job/55717683328?pr=1522
Unfortunately its considered a messy selector. For some reason the lint passes locally for myself but within Ci it fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this failing is due to the Pint linter rules that takes a bit more setup to install locally. I've no strong feelings about this either way. On one hand it's good to be specific to the three digit setup, on the other it could be an issue if mixed with text, e.g. 404NotFound as a label.
I doubt we will run into the latter, so for now I think 4.. is perfectly fine
apache-couchdb-mixin/mixin.libsonnet
Outdated
| cluster+: { | ||
| allValue: '.*', | ||
| }, | ||
| couchb_cluster+: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couchb --> couchdb, missing d character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yep thank you for the catch!
| { | ||
| alert: 'CouchDBUnhealthyCluster', | ||
| expr: ||| | ||
| min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable) < %(alertsCriticalClusterIsUnstable5m)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the alerts use the filteringSelector from the config? Just wondering.
expr: |||
min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable{%(filteringSelector)s}) < %(alertsCriticalClusterIsUnstable5m)s
||| % this.config,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went ahead and implemented these!
| alertsCriticalClusterIsUnstable5m: 1, //1 is stable | ||
| alertsWarning4xxResponseCodes5m: 5, | ||
| alertsCritical5xxResponseCodes5m: 0, | ||
| alertsWarningRequestLatency5m: 500, //ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in milliseconds, which means that the alert threshold is broken.
| { | ||
| alert: 'CouchDBHighRequestLatency', | ||
| expr: ||| | ||
| sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) > %(alertsCriticalRequestLatency5m)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that you didn't change this but I think there is a bug in the calculation, sum / count is in seconds but the config declares the threshold in milliseconds (500), we should multiply the value before the compare operator by 1000. I think this is the second time I catch this bug, which is not on you but it's there in the original implementations.
expr: |||
sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) * 1000 > %(alertsWarningRequestLatency5m)s
||| % this.config,
Alternatively, we can change the alert threshold in the config. I am fine with either solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yep didn't really audit the content of the queries too much. Will go ahead and take a stab at a fix
| unit: 'requests', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'sum by(' + groupLabelAggTerm + ') (couchdb_httpd_status_codes{%(queriesSelector)s, code=~"[45].*"})', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not using increase(...) like the goodResponseStatuses signal above it? Also neither interval nor offset. At least when I read them, I get the feeling that they are complementary but their expressions are not really consistent.
| + g.query.prometheus.withInstant(true) | ||
| + g.query.prometheus.withFormat('timeseries'), | ||
| ]) | ||
| + g.panel.gauge.queryOptions.withDatasource('prometheus', '${' + this.grafana.variables.datasources.prometheus.name + '}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to specify the datasource here? I think it should be derived from the board config and selectors.
| expr: 'couchdb_httpd_view_reads_total{%(queriesSelector)s}', | ||
| }, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that these signals don't specify a rate(...)[$__rate_interval] like other counterparts in overview.libsonnet, does the function also break the panels for you? At least for me with the databricks data, it can be a gauge type and it might break it. But here i see that these are counters 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are counters so the rate gets auto-added from my experience!
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this another pass while checking with the sample-app and I realised the node dashboard is pretty cluttered and the panels hard to read.
I've left a couple comments but I think it's a more overarching issue for the node views so happy to set up a call to discuss between us :)
| expr: ||| | ||
| sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.*"}[5m])) > %(alertsWarning4xxResponseCodes5m)s | ||
| ||| % $._config, | ||
| sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this failing is due to the Pint linter rules that takes a bit more setup to install locally. I've no strong feelings about this either way. On one hand it's good to be specific to the three digit setup, on the other it could be an issue if mixed with text, e.g. 404NotFound as a label.
I doubt we will run into the latter, so for now I think 4.. is perfectly fine
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple more minor things, but it's almost there from my perspective!
…; fix log link in dashboards
…ouchdb-modernization
| sum by(job, instance) (increase(couchdb_httpd_status_codes{%(filteringSelector)s,code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s | ||
| ||| % this.config, | ||
| sum by(job, instance) (increase(couchdb_httpd_status_codes{%(filteringSelector)s}[5m])) > %(alertsWarning4xxResponseCodes5m)s | ||
| ||| % (this.config { filteringSelector: if this.config.filteringSelector != '' then this.config.filteringSelector + ',code=~"4.."' else 'code=~"4.."' }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only here twice so I don't feel strongly about it, but if we need to do similar in other mixins then let's extract that to a variable
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for all the iteration on this one Keith!
aalhour
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot!
| "label": "Level", | ||
| "multi": true, | ||
| "name": "level", | ||
| "query": "label_values({,job=~\"$job\",couchdb_cluster=~\"$couchdb_cluster\",cluster=~\"$cluster\"}, level)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dasomeone @aalhour I'm noticing that with the empty filteringSelector we get this invalid label. Is this something we're okay with expecting that people are going to probably overwrite the filteringSelector? Otherwise just setting it to something like job=~".+" could also be appropriate maybe?
This only affects logs dashboards as far as I can tell.
This modernizes the CouchDB Mixin to use newer libraries.
Overview




Nodes:



Logs:
