-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Lately, the failure rate of pipelines due to system failures started increasing. To give some concrete example from today, here is a pipeline which should be ✔️ but failed because of system failures:
On the same day we run two pipelines on develop:
- https://gitlab.spack.io/spack/spack/-/pipelines/912139
- https://gitlab.spack.io/spack/spack/-/pipelines/911863
They are both ❌ due to system failures.
Whenever something like that happens on a PR, the two solutions users have are:
- Close and reopen the PR, to trigger the creation of a new merge commit, and a new set of pipelines
- Comment
@spackbot run pipeline, to re-run all the pipelines
If the failure rate is high enough, there is a fair chance the procedure needs to be repeated a few times to get to a ✔️ CI mark. This has the effect of multiplying by a factor the resources we need to run pipelines, in particular for "generate" jobs, which are always re-run.
I guess the best solution from user's perspective would be having a low failure rate but, in absence of that, I wonder if we could add a new command1:
@spackbot re-run failed pipelines
that re-runs only pipelines that failed due to system errors. This should:
- Reduce the chance of a possible new failure
- Reduce the resources we need to get a given CI run ✔️
Footnotes
-
Naming is tentative, any better choice is welcome ↩