Concurrent runs overwrite intermediate asset data #31870

maurakeith · 2025-08-20T14:25:05Z

maurakeith
Aug 20, 2025

Hi there!

Summary

I am having a problem with concurrent runs for the same job overwriting each other's intermediate asset data. When my sensor triggers multiple runs simultaneously, a downstream asset in one run incorrectly loads the output from an asset in another concurrent run. This causes the pipelines to fail. The issue is that the intermediate data from one run is being overwritten by the data from a concurrent run, causing downstream assets to use the wrong input. I simplified the issue in a test pipeline that reproduces the error below. The logs from my test pipeline confirm this: the second run's asset_b_test logs the value from the first run's asset_a_test instead of its own.

Example Code

Here is a simplified version of my code that reproduces the issue:

Sensor: `test_sensor.py`

This sensor, whenever it runs, will trigger 2 runs of my test_job. The run_key is a UUID, and run_config is another UUID, so they are guaranteed to be unique.

import uuid
from dagster import RunRequest, sensor
from app.job import test_job

@sensor(name="my_test_sensor",
        job=test_job,
        minimum_interval_seconds=600)
def test_sensor(context):
    for _ in range(2):
        run_config={'ops': {}}
        run_config["ops"]["asset_a_test"] = {
            "config":{"id_val": str(uuid.uuid4())}
        }
        yield RunRequest(run_key=str(uuid.uuid4()), run_config=run_config)

Assets: `test_assets.py`

I have 2 assets: asset a simply takes in the run config with id_val and returns it. asset b receives the output from asset a and returns f"Asset B received: {id_val}"

import logging
from dagster import asset, AssetIn

logger = logging.getLogger(__name__)

@asset(
    key="asset_a_test",
)
def asset_a(
    context
) -> str:
    val = context.op_config['id_val']
    logger.info(val)
    return val

@asset(
    key="asset_b_test",
    ins={"asset_a_test": AssetIn(key="asset_a_test")},
)
def asset_b(
    context,
    asset_a_test: str
) -> str:
    val = f"Asset B received: {asset_a_test}"
    logger.info(val)
    return val

Job: `job.py`

from dagster import define_asset_job, AssetSelection, load_assets_from_modules
from app import test_assets

all_test_assets = load_assets_from_modules([test_assets])
test_asset_selection = AssetSelection.assets(*all_test_assets)
test_job = define_asset_job(
    name="add_test_data",
    selection=test_asset_selection,
    description="Wrangle Test Data",
)

Definition: `definitions.py`

from dagster import Definitions
from app.job import all_test_assets, test_job
from app.test_sensor import my_test_sensor
defs = Definitions(
    assets=[*all_test_assets],
    jobs=[test_job],
    sensors=[my_test_sensor],
)

I see that the asset materializations from my runs are stored in a common location. That is, there isn't a run-specific location that asset materializations are stored in. asset_a_test and asset_b_test are simply overwritten with every run:

├── .dagster/
    └── storage/
        ├── asset_a_test
        └── asset_b_test

Troubleshooting Steps Taken

Used FilesystemIOManager to attempt to save asset materializations to a unique location
Define a custom IO manager to attempt to save asset materializations to a unique location
Utilize Dagster's Output class with metadata defined to differentiate asset materializations
Reviewed definitions.py: I've confirmed that no custom IO manager is explicitly defined in my definitions.py file, so the pipelines are using Dagster's default. I've also verified that there is no difference in the IO manager configuration between my working and non-working pipelines.

It appears I need a way to make Dagster's IO manager store intermediate outputs in a run-specific path to avoid these conflicts. Any guidance on how to properly implement a custom IO manager or another solution would be greatly appreciated.

zyd14 · 2025-10-25T21:31:56Z

zyd14
Oct 25, 2025

You could have a custom IO manager that just prefixes its output with the run ID taken from the OutputContext.

class MyIoManager(IOManager):
  def get_path(context):
    return f"/some/path/{context.run_id}/{context.asset_key}"
  def load_input(context):
    input_path = self.get_path(context)
    with open(input_path, "r") as fin:
      ...
  def handle_output(context: OutputContext, obj: Any):
    output_path = self.get_path(context)
    with open(output_path, "w") as fout:
      ...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Concurrent runs overwrite intermediate asset data #31870

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Concurrent runs overwrite intermediate asset data #31870

Uh oh!

maurakeith Aug 20, 2025

Summary

Example Code

Sensor: test_sensor.py

Assets: test_assets.py

Job: job.py

Definition: definitions.py

Troubleshooting Steps Taken

Replies: 1 comment

Uh oh!

zyd14 Oct 25, 2025

maurakeith
Aug 20, 2025

Sensor: `test_sensor.py`

Assets: `test_assets.py`

Job: `job.py`

Definition: `definitions.py`

zyd14
Oct 25, 2025