-
Notifications
You must be signed in to change notification settings - Fork 176
chore: Apache HBase modernization #1537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Apache HBase modernization #1537
Conversation
…pache-hbase-modernization
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One teensy comment from my end, otherwise looks fantastic!
aalhour
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few requests for changes, below is a summary:
- Do we want to add
{%(filteringSelector)s}to the alerts below? - There are a few raw signals in regionserver and cluster files that we might want to change to counter type with a
rangeFunction: 'rate'or convert the query torate(...), orincrease(...). - There is a counter metric (
totalRequestRate) that might want arangeFunction: 'rate'added to it - I noticed that the
masterQueueSizemetric has a weird unit ('decbytes') but it's a number (total items in a queue), shouldn't it beshort? - There are a few duplicate signals between regionserver and cluster files with the same key, metric but slightly different queries (outlined below)
| { | ||
| alert: 'HBaseDeadRegionServer', | ||
| expr: ||| | ||
| server_num_dead_region_servers > %(alertsDeadRegionServer)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use the filteringSelector?
apache-hbase-mixin/alerts.libsonnet
Outdated
| { | ||
| alert: 'HBaseOldRegionsInTransition', | ||
| expr: ||| | ||
| 100 * assignment_manager_rit_count_over_threshold / clamp_min(assignment_manager_rit_count, 1) > %(alertsOldRegionsInTransition)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use the filteringSelector?
apache-hbase-mixin/alerts.libsonnet
Outdated
| { | ||
| alert: 'HBaseHighMasterAuthFailRate', | ||
| expr: ||| | ||
| 100 * rate(master_authentication_failures[5m]) / (clamp_min(rate(master_authentication_successes[5m]), 1) + clamp_min(rate(master_authentication_failures[5m]), 1)) > %(alertsHighMasterAuthFailRate)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use the filteringSelector?
apache-hbase-mixin/alerts.libsonnet
Outdated
| { | ||
| alert: 'HBaseHighRSAuthFailRate', | ||
| expr: ||| | ||
| 100 * rate(region_server_authentication_failures[5m]) / (clamp_min(rate(region_server_authentication_successes[5m]), 1) + clamp_min(rate(region_server_authentication_failures[5m]), 1)) > %(alertsHighRSAuthFailRate)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use the filteringSelector?
| unit: 'reqps', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'sum by(' + groupAggList + ') (master_authentication_successes{%(queriesSelector)s})', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example: This signal reads as "Rate of ..." but the query doesn't use the rate(...) function, which most likely will lead them to be displayed as as raw counter values instead of rates.
| unit: 'reqps', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'sum by(' + groupAggList + ') (rate(region_server_authentication_successes{%(queriesSelector)s}[$__rate_interval]))', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a raw metric that uses a rate function correctly but it's also a duplicate of the same signal and same metric in cluster.libsonnet file but that one doesn't use the rate function as this one. Can we remove it from the clusters signals?
| unit: 'reqps', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'sum by(' + groupAggList + ') (region_server_authentication_successes{%(queriesSelector)s})', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate of the same signal with the same metric (but different query) in regionserver.libsonnet: https://github.com/grafana/jsonnet-libs/pull/1537/files#r2560283881
| unit: 'reqps', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'sum by(' + groupAggList + ') (region_server_authentication_failures{%(queriesSelector)s})', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another duplicate of the same signal in regionserver.libsonnet file.
| unit: 'reqps', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'sum by(' + groupAggList + ') (rate(region_server_authentication_failures{%(queriesSelector)s}[$__rate_interval]))', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a correct raw signal that uses a rate function but it's a duplicate of another one in cluster.libsonnet, can we remove the other one from the cluster signals file?
apache-hbase-mixin/panels.libsonnet
Outdated
| g.panel.table.standardOptions.withLinks([ | ||
| { | ||
| title: '', | ||
| url: '/d/apachehbase-regionserver-overview?from=${__from}&to=${__to}&var-instance=${__data.fields["RegionServer"]}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to use the links from links.libsonnet?
…pache-hbase-modernization
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last little thing in addition to @aalhour's review. Would also be good if we can get the existing comment threads resolved (even if they already are)
| name: 'Master queue size', | ||
| type: 'gauge', | ||
| description: 'The size of the queue of requests, operations, and tasks to be processed by the master.', | ||
| unit: 'decbytes', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it's number of items in queue, and if we want to display actual byte size we need to multiple by the block size.
I think @aalhour is correct that the original intent was to use just number of items in queue since blocksize is configurable based on file system. Let's switch this to short
…pache-hbase-modernization
apache-hbase-mixin/config.libsonnet
Outdated
| { | ||
| local this = self, | ||
| filteringSelector: 'job="integrations/apache-hbase"', | ||
| filteringSelector: 'job=~".+"', // set to apply static filters to all queries and alerts, i.e. job="bar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pull latest logs-lib into the vendor, you should be able to get away with just an empty string here, but this is fine too :)
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Teensy tiny comment on the vendoring, otherwise good to go! Looks great, thanks!
…pache-hbase-modernization
This PR modernizes the Apache HBase mixin to use commonlib and g.libsonnet to generate its dashboards. This will make future stylizations easier to accomplish :)
Feel free to use this PR to generate a Sample environment: grafana/integration-sample-apps#102


Cluster Overview:
RegionServer Overview:


Logs:
