Skip to content

Conversation

@schmikei
Copy link
Contributor

@schmikei schmikei commented Nov 6, 2025

This PR modernizes the Apache HBase mixin to use commonlib and g.libsonnet to generate its dashboards. This will make future stylizations easier to accomplish :)

Feel free to use this PR to generate a Sample environment: grafana/integration-sample-apps#102
Cluster Overview:
image
image

RegionServer Overview:
image
image

Logs:
image

@schmikei schmikei marked this pull request as ready for review November 6, 2025 19:50
@schmikei schmikei requested a review from a team as a code owner November 6, 2025 19:50
Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One teensy comment from my end, otherwise looks fantastic!

Copy link
Contributor

@aalhour aalhour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few requests for changes, below is a summary:

  • Do we want to add {%(filteringSelector)s} to the alerts below?
  • There are a few raw signals in regionserver and cluster files that we might want to change to counter type with a rangeFunction: 'rate' or convert the query to rate(...), or increase(...).
  • There is a counter metric (totalRequestRate) that might want a rangeFunction: 'rate' added to it
  • I noticed that the masterQueueSize metric has a weird unit ('decbytes') but it's a number (total items in a queue), shouldn't it be short?
  • There are a few duplicate signals between regionserver and cluster files with the same key, metric but slightly different queries (outlined below)

{
alert: 'HBaseDeadRegionServer',
expr: |||
server_num_dead_region_servers > %(alertsDeadRegionServer)s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the filteringSelector?

{
alert: 'HBaseOldRegionsInTransition',
expr: |||
100 * assignment_manager_rit_count_over_threshold / clamp_min(assignment_manager_rit_count, 1) > %(alertsOldRegionsInTransition)s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the filteringSelector?

{
alert: 'HBaseHighMasterAuthFailRate',
expr: |||
100 * rate(master_authentication_failures[5m]) / (clamp_min(rate(master_authentication_successes[5m]), 1) + clamp_min(rate(master_authentication_failures[5m]), 1)) > %(alertsHighMasterAuthFailRate)s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the filteringSelector?

{
alert: 'HBaseHighRSAuthFailRate',
expr: |||
100 * rate(region_server_authentication_failures[5m]) / (clamp_min(rate(region_server_authentication_successes[5m]), 1) + clamp_min(rate(region_server_authentication_failures[5m]), 1)) > %(alertsHighRSAuthFailRate)s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the filteringSelector?

unit: 'reqps',
sources: {
prometheus: {
expr: 'sum by(' + groupAggList + ') (master_authentication_successes{%(queriesSelector)s})',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example: This signal reads as "Rate of ..." but the query doesn't use the rate(...) function, which most likely will lead them to be displayed as as raw counter values instead of rates.

unit: 'reqps',
sources: {
prometheus: {
expr: 'sum by(' + groupAggList + ') (rate(region_server_authentication_successes{%(queriesSelector)s}[$__rate_interval]))',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a raw metric that uses a rate function correctly but it's also a duplicate of the same signal and same metric in cluster.libsonnet file but that one doesn't use the rate function as this one. Can we remove it from the clusters signals?

unit: 'reqps',
sources: {
prometheus: {
expr: 'sum by(' + groupAggList + ') (region_server_authentication_successes{%(queriesSelector)s})',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of the same signal with the same metric (but different query) in regionserver.libsonnet: https://github.com/grafana/jsonnet-libs/pull/1537/files#r2560283881

unit: 'reqps',
sources: {
prometheus: {
expr: 'sum by(' + groupAggList + ') (region_server_authentication_failures{%(queriesSelector)s})',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another duplicate of the same signal in regionserver.libsonnet file.

unit: 'reqps',
sources: {
prometheus: {
expr: 'sum by(' + groupAggList + ') (rate(region_server_authentication_failures{%(queriesSelector)s}[$__rate_interval]))',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correct raw signal that uses a rate function but it's a duplicate of another one in cluster.libsonnet, can we remove the other one from the cluster signals file?

g.panel.table.standardOptions.withLinks([
{
title: '',
url: '/d/apachehbase-regionserver-overview?from=${__from}&to=${__to}&var-instance=${__data.fields["RegionServer"]}',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to use the links from links.libsonnet?

@schmikei schmikei enabled auto-merge (squash) December 11, 2025 16:18
Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last little thing in addition to @aalhour's review. Would also be good if we can get the existing comment threads resolved (even if they already are)

name: 'Master queue size',
type: 'gauge',
description: 'The size of the queue of requests, operations, and tasks to be processed by the master.',
unit: 'decbytes',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it's number of items in queue, and if we want to display actual byte size we need to multiple by the block size.

I think @aalhour is correct that the original intent was to use just number of items in queue since blocksize is configurable based on file system. Let's switch this to short

@schmikei schmikei disabled auto-merge December 16, 2025 20:24
{
local this = self,
filteringSelector: 'job="integrations/apache-hbase"',
filteringSelector: 'job=~".+"', // set to apply static filters to all queries and alerts, i.e. job="bar"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pull latest logs-lib into the vendor, you should be able to get away with just an empty string here, but this is fine too :)

Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Teensy tiny comment on the vendoring, otherwise good to go! Looks great, thanks!

@schmikei schmikei merged commit a4f2a73 into grafana:master Dec 17, 2025
11 checks passed
@schmikei schmikei deleted the chore/apache-hbase-modernization branch December 17, 2025 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants