Skip to content

Conversation

@schmikei
Copy link
Contributor

Updates the IBM MQ mixin to use more modern libraries

IBM MQ cluster overview

image

IBM MQ queue manager overview
image
image

IBM MQ queue overview
image
image

IBM MQ topic overview

I couldn't figure out how to quickly generate topic metrics so I mostly validated query via git diff

image image

IBM MQ logs
image

Metrics should be flowing to the shared Grafana instance on Nov 25 12pm-5pm EST :)

@schmikei schmikei force-pushed the ibm-mq-modernization branch from b01900a to 321be7d Compare November 25, 2025 21:29
@schmikei schmikei marked this pull request as ready for review November 25, 2025 21:31
@schmikei schmikei requested a review from a team as a code owner November 25, 2025 21:31
@Dasomeone Dasomeone self-assigned this Dec 4, 2025
Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor comments here, overall looks great

*/

topicMessagesReceived:
commonlib.panels.generic.timeSeries.base.new('Topic messages received', targets=[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a recurring thing for the topics, but I think having an override here for "no data" showing 0 instead might be valid. If you've concerns I'm happy to hear them too, not 100% set on it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait I just scrolled a bit further in the screenshots, and I noticed that the subscription status table does in fact show two topics:

  • dev/telemetry/metrics
  • dev/telemetry/+

Should these not show up under the topic queries?

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that my topic traffic that I tried to simulate did not generate load for these metrics... as noted in the PR description mostly validated via the git diff mostly for these panels since I couldn't quickly get metrics loaded. Let me know if you want me to look deeper into it!

As for the noValue are you suggesting something like this even for a timeSeries:

image

I suppose I could also do an or vector(0) for those panels? I think it makes sense for stat panels maybe though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no I specifically meant using the no value standard options override
image

We should try to avoid using or vector(0) outside of testing as it causes a lot of visualisation artefacts

Also in terms of testing that makes sense, I'd missed the git-dff part for the topic!, all good, I know that's a hard one to generate load on. if you're confident the timeseries panels matches previous (as git diff) then I'm happy with it

Copy link
Contributor

@aalhour aalhour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor changes.

summary: 'There are expired messages, which imply that application resilience is failing.',
description:
(
'The number of expired messages in the {{$labels.qmgr}} is {{$labels.value}} which is above the threshold of %(alertsExpiredMessages)s.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the description reference the $value of metric? Probably need to format it as well: {{ printf "%.0f" $value }}, WDYT?

summary: 'Stale messages have been detected.',
description:
(
'A stale message with an age of {{$labels.value}} has been sitting in the {{$labels.queue}} which is above the threshold of %(alertsStaleMessagesSeconds)ss.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about referencing {{ $value }} instead of {{ $labels.value }}.

summary: 'There is limited disk available for a queue manager.',
description:
(
'The amount of disk space available for {{$labels.qmgr}} is at {{$labels.value}}%% which is below the threshold of %(alertsLowDiskSpace)s%%.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about referencing {{ $value }} instead of {{ $labels.value }}.

summary: 'There is a high CPU usage estimate for a queue manager.',
description:
(
'The amount of CPU usage for the queue manager {{$labels.qmgr}} is at {{$labels.value}}%% which is above the threshold of %(alertsHighQueueManagerCpuUsage)s%%.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about referencing {{ $value }} instead of {{ $labels.value }}.

name: 'Time on queue',
type: 'gauge',
description: 'The average time messages spent on the queue.',
unit: 'µs',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric says _seconds but the unit says micro seconds, can you please change it to 's'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this one needs updating though, thanks for the catch 👍

signals.queueManager.queueOperationsMqput.asTarget(),
])
+ g.panel.timeSeries.panelOptions.withDescription('The number of queue operations of the queue manager.')
+ g.panel.timeSeries.standardOptions.withUnit('operations')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think 'operations' is a unit that we support in jsonnet-lib/common-lib. Can we use 'short' instead?

signals.queue.mqputMqput1Count.asTarget(),
])
+ g.panel.timeSeries.panelOptions.withDescription('The number of queue operations of the queue manager.')
+ g.panel.timeSeries.standardOptions.withUnit('operations')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think 'operations' is a unit that we support in jsonnet-lib/common-lib. Can we use 'short' instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several signals have the unit 'operations' and I don't think it is a supported unit in jsonnet-libs/common-lib. Can we use 'short' instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to queue signals.

Several signals have the unit 'operations' and I don't think it is a supported unit in jsonnet-libs/common-lib. Can we use 'short' instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the two other signals files.

Several signals have the unit 'operations' and I don't think it is a supported unit in jsonnet-libs/common-lib. Can we use 'short' instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operations is preferable here actually. If Grafana doesn't recognise a unit, it's treated as a custom string unit, so it'll be just fine

@aalhour aalhour self-assigned this Dec 11, 2025
Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple more comments, overall I'm happy with it!

{
alert: 'IBMMQLowDiskSpace',
expr: |||
sum without (description,hostname,instance,job,platform) (ibmmq_qmgr_queue_manager_file_system_free_space_percentage{%(filteringSelector)s}) <= %(alertsLowDiskSpace)s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sum without (description,hostname,instance,job,platform) (ibmmq_qmgr_queue_manager_file_system_free_space_percentage{%(filteringSelector)s}) <= %(alertsLowDiskSpace)s
sum without (description,hostname,instance,job,platform) (ibmmq_qmgr_queue_manager_file_system_free_space_percentage{%(filteringSelector)s}) < %(alertsLowDiskSpace)s

Comment on lines 27 to 28


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines 32 to 33


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

*/

topicMessagesReceived:
commonlib.panels.generic.timeSeries.base.new('Topic messages received', targets=[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no I specifically meant using the no value standard options override
image

We should try to avoid using or vector(0) outside of testing as it causes a lot of visualisation artefacts

Also in terms of testing that makes sense, I'd missed the git-dff part for the topic!, all good, I know that's a hard one to generate load on. if you're confident the timeseries panels matches previous (as git diff) then I'm happy with it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operations is preferable here actually. If Grafana doesn't recognise a unit, it's treated as a custom string unit, so it'll be just fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants