-
Notifications
You must be signed in to change notification settings - Fork 176
Modernize the Apache Airflow mixin #1540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Working on getting some improved loadgen for this but think the Mixin is probably good to review for general guidance |
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally layout looks fine, but hard to review given current sample-app and lack of screenshots. Are we missing any load-gen for the sample-app?
Happy to give this another look this week
| dashboardTimezone: 'default', | ||
| dashboardRefresh: '1m', | ||
| local this = self, | ||
| filteringSelector: 'job="integrations/apache-airflow"', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the job label
| alertsCriticalDAGScheduleDelayLevel: 60, //s | ||
| alertsCriticalFailedDAGs: 0, | ||
| // additional params | ||
| dashboardPeriod: 'now-6h', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 hour period may be a bit long
| new(this): { | ||
| apacheAirflowOverview: | ||
| link.link.new(this.config.dashboardNamePrefix + ' overview', '/d/' + this.grafana.dashboards['apache-airflow-overview.json'].uid) | ||
| + link.link.options.withKeepTime(true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| + link.link.options.withKeepTime(true), | |
| + link.link.options.withKeepTime(true), | |
| + link.dashboards.options.withIncludeVars(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| g.dashboard.variable.query.new('dag_id') | ||
| + g.dashboard.variable.query.withDatasourceFromVariable(vars.datasources.prometheus) | ||
| + g.dashboard.variable.query.queryTypes.withLabelValues('dag_id', 'airflow_dagrun_duration_success_sum{job=~"$job", instance=~"$instance"}') | ||
| + g.dashboard.variable.query.generalOptions.withLabel('DAG ID') | ||
| + g.dashboard.variable.query.selectionOptions.withMulti(true) | ||
| + g.dashboard.variable.query.selectionOptions.withIncludeAll(true, '.+') | ||
| + g.dashboard.variable.query.refresh.onLoad() | ||
| + g.dashboard.variable.query.refresh.onTime(), | ||
|
|
||
| g.dashboard.variable.query.new('task_id') | ||
| + g.dashboard.variable.query.withDatasourceFromVariable(vars.datasources.prometheus) | ||
| + g.dashboard.variable.query.queryTypes.withLabelValues('task_id', 'airflow_task_finish_total{job=~"$job", instance=~"$instance"}') | ||
| + g.dashboard.variable.query.generalOptions.withLabel('Task') | ||
| + g.dashboard.variable.query.selectionOptions.withMulti(true) | ||
| + g.dashboard.variable.query.selectionOptions.withIncludeAll(true, '.+') | ||
| + g.dashboard.variable.query.refresh.onLoad() | ||
| + g.dashboard.variable.query.refresh.onTime(), | ||
|
|
||
| g.dashboard.variable.query.new('state') | ||
| + g.dashboard.variable.query.withDatasourceFromVariable(vars.datasources.prometheus) | ||
| + g.dashboard.variable.query.queryTypes.withLabelValues('state', 'airflow_task_finish_total{job=~"$job", instance=~"$instance"}') | ||
| + g.dashboard.variable.query.generalOptions.withLabel('State') | ||
| + g.dashboard.variable.query.selectionOptions.withMulti(true) | ||
| + g.dashboard.variable.query.selectionOptions.withIncludeAll(true, '.+') | ||
| + g.dashboard.variable.query.refresh.onLoad() | ||
| + g.dashboard.variable.query.refresh.onTime(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These three template variables aren't showing any values in the sample-app. Missing load-gen perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, the metrics don't seem to be present
| dagFileParsingErrorsPanel: | ||
| commonlib.panels.generic.timeSeries.base.new('DAG file parsing errors', targets=[signals.overview.dagParsingErrors.asTarget()]) | ||
| + g.panel.timeSeries.panelOptions.withDescription('The number of errors from trying to parse DAG files in an Apache Airflow system.') | ||
| + g.panel.timeSeries.standardOptions.withUnit('none'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of none, use short where possible
| groupLabels: this.groupLabels, | ||
| instanceLabels: this.instanceLabels, | ||
| enableLokiLogs: this.enableLokiLogs, | ||
| legendCustomTemplate: legendCustmoTemplate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo on legendCustmoTemplate
aalhour
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few important notes, in addition to Emily's comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use 'short' instead of 'none' for count metrics? Some examples:
dagParsingErrorsdagSuccessDurationCountdagFailedDurationCounttaskFailurestaskSlaMisses
| }, | ||
| }, | ||
|
|
||
| dagSuccessDurationSum: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
| }, | ||
| }, | ||
|
|
||
| dagSuccessDurationCount: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
| }, | ||
| }, | ||
|
|
||
| dagFailedDurationSum: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
| }, | ||
| }, | ||
|
|
||
| dagFailedDurationCount: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
| expr: ||| | ||
| increase(airflow_dagrun_schedule_delay_sum{%(filteringSelector)s}[5m]) / clamp_min(increase(airflow_dagrun_schedule_delay_count{%(filteringSelector)s}[5m]),1) > %(alertsCriticalDAGScheduleDelayLevel)s | ||
| ||| % this.config, | ||
| 'for': '1m', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: this could cause a flapping alert, can we make it 5m?
| expr: ||| | ||
| increase(airflow_dagrun_duration_failed_count{%(filteringSelector)s}[5m]) > %(alertsCriticalFailedDAGs)s | ||
| ||| % this.config, | ||
| 'for': '1m', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this: this could cause a flapping alert, can we make it 5m?
|
|
||
| slaIssuesPanel: | ||
| commonlib.panels.generic.timeSeries.base.new('SLA misses', targets=[ | ||
| signals.overview.taskSlaMisses.asTarget() { interval: '2m' }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we hardcoding the 2m interval here?
|
|
||
| taskFailuresPanel: | ||
| commonlib.panels.generic.timeSeries.base.new('Task failures', targets=[ | ||
| signals.overview.taskFailures.asTarget() { interval: '2m' }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here: why are we hardcoding the 2m interval?
|
|
||
| dagSuccessDurationPanel: | ||
| commonlib.panels.generic.timeSeries.base.new('DAG success duration', targets=[ | ||
| signals.overview.dagAvgSuccessDuration.asTarget() { interval: '2m' }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here: why are we hardcoding the 2m interval?

Modernizes the Apache Airflow mixin to use g.libsonnet/signals/common-lib.
As of submission most validation was mostly via the git diff.