Skip to content

[BUG] Success Rate Incorrect - says 0% #1163

@christensenjairus

Description

@christensenjairus

Version of Eraser

v1.3.1

Expected Behavior

I have multiple clusters running Eraser w/ v1.3.1 and we've set our success rate pretty low (80%), down from 95% because we couldn't get Eraser to mark the ImageJob as successful. Looking at the logs, it seems that there's a bug in the success rate math that causes Eraser to think it's 0% successful when one or two pods fail in a strange way.

 {"level":"info","ts":1755535148.5651102,"logger":"controller","msg":"Marking job as failed","process":"imagejob-controller","success ratio":0.8,"actual ratio":0}

In reality, the job had 272 successful nodes and one node that causes the pod to reach an outOfCpu state. This happened on other clusters with nodes w/ memory pressure instead of cpu pressure.

Expected behavior: the ImageJob is marked as successful (as it's currently >99% successful) and the pods are cleaned up (we have .runtimeConfig.manager.imageJob.cleanup.delayOnSuccess set to 0s).

Actual Behavior

Actual behavior: ImageJob fails w/ 0% success rate and pods aren't cleaned up. (we have .runtimeConfig.manager.imageJob.cleanup.delayOnFailure set to 5h).

Steps To Reproduce

K8s v1.32.6
Eraser helm chart v1.3.1

helm values:

runtimeConfig:
  manager:
    nodeFilter:
      type: exclude
      selectors:
        - eraser.sh/exclude-node # exclude nodes with this label
    scheduling:
      repeatInterval: "6h" # default is 24h
    imageJob:
      successRatio: 0.80 # 80% success ratio for image jobs to be considered 'successful'. Needs to be lower than 100% to account for cpu/memory pressure that causes the job to fail occasionally.
      cleanup:
        delayOnSuccess: "0s" # clean up pods immediately after success
        delayOnFailure: "5h" # keep the pods around for 5 hours after failure to allow for investigation

Then get a node to have enough cpu/mem pressure to cause an imagejob pod to error with outOfCpu or outOfMemory.

Are you willing to submit PRs to contribute to this bug fix?

  • Yes, I am willing to implement it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions