Skip to content

Conversation

@schmikei
Copy link
Contributor

@schmikei schmikei commented Nov 11, 2025

Modernizes the Apache Airflow mixin to use g.libsonnet/signals/common-lib.

As of submission most validation was mostly via the git diff.

@schmikei
Copy link
Contributor Author

Working on getting some improved loadgen for this but think the Mixin is probably good to review for general guidance

@schmikei schmikei marked this pull request as ready for review November 25, 2025 21:34
@schmikei schmikei requested a review from a team as a code owner November 25, 2025 21:34
Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally layout looks fine, but hard to review given current sample-app and lack of screenshots. Are we missing any load-gen for the sample-app?

Happy to give this another look this week

dashboardTimezone: 'default',
dashboardRefresh: '1m',
local this = self,
filteringSelector: 'job="integrations/apache-airflow"',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the job label

alertsCriticalDAGScheduleDelayLevel: 60, //s
alertsCriticalFailedDAGs: 0,
// additional params
dashboardPeriod: 'now-6h',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 hour period may be a bit long

new(this): {
apacheAirflowOverview:
link.link.new(this.config.dashboardNamePrefix + ' overview', '/d/' + this.grafana.dashboards['apache-airflow-overview.json'].uid)
+ link.link.options.withKeepTime(true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ link.link.options.withKeepTime(true),
+ link.link.options.withKeepTime(true),
+ link.dashboards.options.withIncludeVars(true)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Struggled to review as the mixin didn't seem to use metrics provided by the sample app all that much

Image

Comment on lines +34 to +59
g.dashboard.variable.query.new('dag_id')
+ g.dashboard.variable.query.withDatasourceFromVariable(vars.datasources.prometheus)
+ g.dashboard.variable.query.queryTypes.withLabelValues('dag_id', 'airflow_dagrun_duration_success_sum{job=~"$job", instance=~"$instance"}')
+ g.dashboard.variable.query.generalOptions.withLabel('DAG ID')
+ g.dashboard.variable.query.selectionOptions.withMulti(true)
+ g.dashboard.variable.query.selectionOptions.withIncludeAll(true, '.+')
+ g.dashboard.variable.query.refresh.onLoad()
+ g.dashboard.variable.query.refresh.onTime(),

g.dashboard.variable.query.new('task_id')
+ g.dashboard.variable.query.withDatasourceFromVariable(vars.datasources.prometheus)
+ g.dashboard.variable.query.queryTypes.withLabelValues('task_id', 'airflow_task_finish_total{job=~"$job", instance=~"$instance"}')
+ g.dashboard.variable.query.generalOptions.withLabel('Task')
+ g.dashboard.variable.query.selectionOptions.withMulti(true)
+ g.dashboard.variable.query.selectionOptions.withIncludeAll(true, '.+')
+ g.dashboard.variable.query.refresh.onLoad()
+ g.dashboard.variable.query.refresh.onTime(),

g.dashboard.variable.query.new('state')
+ g.dashboard.variable.query.withDatasourceFromVariable(vars.datasources.prometheus)
+ g.dashboard.variable.query.queryTypes.withLabelValues('state', 'airflow_task_finish_total{job=~"$job", instance=~"$instance"}')
+ g.dashboard.variable.query.generalOptions.withLabel('State')
+ g.dashboard.variable.query.selectionOptions.withMulti(true)
+ g.dashboard.variable.query.selectionOptions.withIncludeAll(true, '.+')
+ g.dashboard.variable.query.refresh.onLoad()
+ g.dashboard.variable.query.refresh.onTime(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three template variables aren't showing any values in the sample-app. Missing load-gen perhaps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, the metrics don't seem to be present

dagFileParsingErrorsPanel:
commonlib.panels.generic.timeSeries.base.new('DAG file parsing errors', targets=[signals.overview.dagParsingErrors.asTarget()])
+ g.panel.timeSeries.panelOptions.withDescription('The number of errors from trying to parse DAG files in an Apache Airflow system.')
+ g.panel.timeSeries.standardOptions.withUnit('none'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of none, use short where possible

groupLabels: this.groupLabels,
instanceLabels: this.instanceLabels,
enableLokiLogs: this.enableLokiLogs,
legendCustomTemplate: legendCustmoTemplate,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo on legendCustmoTemplate

@Dasomeone Dasomeone self-assigned this Dec 16, 2025
Copy link
Contributor

@aalhour aalhour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few important notes, in addition to Emily's comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use 'short' instead of 'none' for count metrics? Some examples:

  • dagParsingErrors
  • dagSuccessDurationCount
  • dagFailedDurationCount
  • taskFailures
  • taskSlaMisses

},
},

dagSuccessDurationSum: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

},
},

dagSuccessDurationCount: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

},
},

dagFailedDurationSum: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

},
},

dagFailedDurationCount: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

expr: |||
increase(airflow_dagrun_schedule_delay_sum{%(filteringSelector)s}[5m]) / clamp_min(increase(airflow_dagrun_schedule_delay_count{%(filteringSelector)s}[5m]),1) > %(alertsCriticalDAGScheduleDelayLevel)s
||| % this.config,
'for': '1m',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: this could cause a flapping alert, can we make it 5m?

expr: |||
increase(airflow_dagrun_duration_failed_count{%(filteringSelector)s}[5m]) > %(alertsCriticalFailedDAGs)s
||| % this.config,
'for': '1m',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this: this could cause a flapping alert, can we make it 5m?


slaIssuesPanel:
commonlib.panels.generic.timeSeries.base.new('SLA misses', targets=[
signals.overview.taskSlaMisses.asTarget() { interval: '2m' },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we hardcoding the 2m interval here?


taskFailuresPanel:
commonlib.panels.generic.timeSeries.base.new('Task failures', targets=[
signals.overview.taskFailures.asTarget() { interval: '2m' },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here: why are we hardcoding the 2m interval?


dagSuccessDurationPanel:
commonlib.panels.generic.timeSeries.base.new('DAG success duration', targets=[
signals.overview.dagAvgSuccessDuration.asTarget() { interval: '2m' },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here: why are we hardcoding the 2m interval?

@aalhour aalhour self-assigned this Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants