Flyte & Flytekit version
flyteidl 1.16.6
flytekit 1.16.14
Describe the bug
When using long-running eager workflows, the flytekit Controller loops inside the _poll function. Every two seconds, it creates a new Deck object and appends it to the global Flyte context's decks list. This list grows indefinitely, inevitably leading to OOM error if the workflow runs long enough.
This is a critical bug because it cannot be resolved by simply increasing the RAM of the workflow pod, at a larger execution scale, the issue will eventually recur. This limitation significantly restricts the potential use cases of the Flyte framework for long-orchestration tasks.
Location of the bug:
flytekit/core/worker_queue.py, line 329
Proposed Solution:
The issue can be resolved by setting auto_add_to_deck=False during deck creation. However, I am uncertain about the potential side effects of this change:
if len(self.entries) > 0:
with self.entries_lock:
html = self.render_html()
FlyteContextManager.push_context(self.remote._ctx)
# Setting auto_add_to_deck=False prevents the object from being appended to the global context
Deck("Eager Executions", html, auto_add_to_deck=False).publish()
FlyteContextManager.pop_context()
Expected behavior
The RAM requirements of the pod running the Eager workflow orchestration task should remain stable. Memory usage should not scale with the number of tasks spawned in the workflow or the total execution time.
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
Flyte & Flytekit version
flyteidl 1.16.6
flytekit 1.16.14
Describe the bug
When using long-running eager workflows, the
flytekitController loops inside the_pollfunction. Every two seconds, it creates a newDeckobject and appends it to the global Flyte context's decks list. This list grows indefinitely, inevitably leading to OOM error if the workflow runs long enough.This is a critical bug because it cannot be resolved by simply increasing the RAM of the workflow pod, at a larger execution scale, the issue will eventually recur. This limitation significantly restricts the potential use cases of the Flyte framework for long-orchestration tasks.
Location of the bug:
flytekit/core/worker_queue.py, line 329Proposed Solution:
The issue can be resolved by setting
auto_add_to_deck=Falseduring deck creation. However, I am uncertain about the potential side effects of this change:Expected behavior
The RAM requirements of the pod running the Eager workflow orchestration task should remain stable. Memory usage should not scale with the number of tasks spawned in the workflow or the total execution time.
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?