When I try to run a training workload with --open, mlpstorage still says Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
I also get a crash later on:
dslik@frosta:~/scratch/mlp3/storage$ ./mlpstorage training run --open --hosts 127.0.0.1 --num-client-hosts 1 --client-host-memory-in-gb 128 --num-accelerators 2 --accelerator-type b200 --model retinanet --param dataset.num_files_train=425000 --file --data-dir /home/dslik/scratch --results-dir /home/dslik/scratch --allow-run-as-root
Setting attr from num_accelerators to 2
Hosts is: ['127.0.0.1']
Hosts is: ['127.0.0.1']
⠋ Validating environment... 0:00:002026-04-22 23:24:12|INFO: Environment validation passed
2026-04-22 23:24:12|STATUS: Benchmark results directory: /home/dslik/scratch/training/retinanet/run/20260422_232412
2026-04-22 23:24:12|INFO: Created benchmark run: training_run_retinanet_20260422_232412
2026-04-22 23:24:12|STATUS: Verifying benchmark run for training_run_retinanet_20260422_232412
2026-04-22 23:24:12|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-22 23:24:12|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 425000 (Parameter: Overrode Parameters)
2026-04-22 23:24:12|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 16550423, Actual: 425000)
2026-04-22 23:24:12|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='retinanet', run_datetime='20260422_232412')])
2026-04-22 23:24:12|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
snip
--------------------------------------------------------------------------
[OUTPUT] 2026-04-22T23:24:15.550253 Running DLIO [Training] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of
dataset to eliminate the caching effect!!!
Error executing job with overrides: ['workload=retinanet_b200', '++workload.dataset.num_files_train=425000',
'++workload.dataset.data_folder=/home/dslik/scratch/retinanet']
Error executing job with overrides: ['workload=retinanet_b200', '++workload.dataset.num_files_train=425000',
'++workload.dataset.data_folder=/home/dslik/scratch/retinanet']
Traceback (most recent call last):
File "/home/dslik/scratch/mlp3/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 482, in run_benchmark
benchmark.initialize()
File "/home/dslik/scratch/mlp3/storage/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
x = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/dslik/scratch/mlp3/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 206, in initialize
assert (num_subfolders == len(filenames))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
When I try to run a training workload with
--open, mlpstorage still saysRunning the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.I also get a crash later on:
snip