Skip to content

[BUG] OOM error with long-running eager workflows #7239

@LanaMorkva

Description

@LanaMorkva

Flyte & Flytekit version

flyteidl 1.16.6
flytekit 1.16.14

Describe the bug

When using long-running eager workflows, the flytekit Controller loops inside the _poll function. Every two seconds, it creates a new Deck object and appends it to the global Flyte context's decks list. This list grows indefinitely, inevitably leading to OOM error if the workflow runs long enough.

This is a critical bug because it cannot be resolved by simply increasing the RAM of the workflow pod, at a larger execution scale, the issue will eventually recur. This limitation significantly restricts the potential use cases of the Flyte framework for long-orchestration tasks.

Location of the bug:
flytekit/core/worker_queue.py, line 329

Proposed Solution:
The issue can be resolved by setting auto_add_to_deck=False during deck creation. However, I am uncertain about the potential side effects of this change:

if len(self.entries) > 0:
    with self.entries_lock:
        html = self.render_html()
        FlyteContextManager.push_context(self.remote._ctx)
        # Setting auto_add_to_deck=False prevents the object from being appended to the global context
        Deck("Eager Executions", html, auto_add_to_deck=False).publish()
        FlyteContextManager.pop_context()

Expected behavior

The RAM requirements of the pod running the Eager workflow orchestration task should remain stable. Memory usage should not scale with the number of tasks spawned in the workflow or the total execution time.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinguntriagedThis issues has not yet been looked at by the Maintainers

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions