Hello!
I am curious how exca might help facilitate multi-stage pipelines where multiple computationally intense operations must be run sequentially on the same data, for example, a pipeline that does data preprocessing, model training, and an analysis of the results. This order of operations would greatly benefit from intermediate caching (so multiple models don't each need to re-preprocess the data, and multiple analysis operations don't need to retrain the model), and my central question is: what the best way to implement such a pipeline is using the exca framework to take advantage of exca's caching functionality?
To layout some possible implementations I am thinking about to solve the problem:
Idea 1:
If tasks are able to be allocated inside of other tasks (an assumption I am not sure about, in my local testing it seems I can allocate "slurm" tasks within a "local" task, but am having issues nesting distributed slurm tasks), perhaps the whole pipeline could work in a nested manner, where the last operation constitutes the outer task (in this example the analysis step), and so the tasks for preprocessing and training execute within this task and return their outputs inside the execution stream for the current step.
The main problem I see with this approach is that the resources for the later steps must be allocated up front, and sit idle waiting for the initial pipeline steps to compute, which is not desirable. The implementation for this approach would also be quite clunky, as you would essentially have to write the pipeline backwards.
Idea 2:
After a step in the pipeline finishes executing (and caching), it would return the path to the cached CacheDict file (I'm not quite sure how to access this from the TaskInfra object or from within the task, but it definitely looks possible). The outer code execution stream would then take that path, and add it to the config for a future step, essentially telling it to "load this file". Future steps would manually initialize the CacheDict object to grab the data from the previous step.
The problems of this approach are that it relies on manually passing around the path to the cached data objects, modules would have the somewhat fragile interface of requiring one of these deep cache directory paths in order to function properly, and we must still perform the I/O operations of saving to and reading from the disk in between steps (this may be unavoidable when moving between different slurm instances though.)
Idea 2 seems the most straightforward for what I am trying to do, I understand that the obvious problem with allowing command line arguments in this framework (to easily pass data around) is that it makes it much more difficult to cache the output, as the output will change with different command line inputs and so any consistent caching system must also hash the command line arguments, which would be non-trivial when they contain complex data structures. Please let me know if this is something you have thought about or if you have any additional suggestions I might try!
Hello!
I am curious how exca might help facilitate multi-stage pipelines where multiple computationally intense operations must be run sequentially on the same data, for example, a pipeline that does data preprocessing, model training, and an analysis of the results. This order of operations would greatly benefit from intermediate caching (so multiple models don't each need to re-preprocess the data, and multiple analysis operations don't need to retrain the model), and my central question is: what the best way to implement such a pipeline is using the exca framework to take advantage of exca's caching functionality?
To layout some possible implementations I am thinking about to solve the problem:
Idea 1:
If tasks are able to be allocated inside of other tasks (an assumption I am not sure about, in my local testing it seems I can allocate "slurm" tasks within a "local" task, but am having issues nesting distributed slurm tasks), perhaps the whole pipeline could work in a nested manner, where the last operation constitutes the outer task (in this example the analysis step), and so the tasks for preprocessing and training execute within this task and return their outputs inside the execution stream for the current step.
The main problem I see with this approach is that the resources for the later steps must be allocated up front, and sit idle waiting for the initial pipeline steps to compute, which is not desirable. The implementation for this approach would also be quite clunky, as you would essentially have to write the pipeline backwards.
Idea 2:
After a step in the pipeline finishes executing (and caching), it would return the path to the cached CacheDict file (I'm not quite sure how to access this from the TaskInfra object or from within the task, but it definitely looks possible). The outer code execution stream would then take that path, and add it to the config for a future step, essentially telling it to "load this file". Future steps would manually initialize the CacheDict object to grab the data from the previous step.
The problems of this approach are that it relies on manually passing around the path to the cached data objects, modules would have the somewhat fragile interface of requiring one of these deep cache directory paths in order to function properly, and we must still perform the I/O operations of saving to and reading from the disk in between steps (this may be unavoidable when moving between different slurm instances though.)
Idea 2 seems the most straightforward for what I am trying to do, I understand that the obvious problem with allowing command line arguments in this framework (to easily pass data around) is that it makes it much more difficult to cache the output, as the output will change with different command line inputs and so any consistent caching system must also hash the command line arguments, which would be non-trivial when they contain complex data structures. Please let me know if this is something you have thought about or if you have any additional suggestions I might try!