Summary
Currently, the only way to persist a generated dataset is via the parquet batch files written internally by the engine. Users who want a single consolidated file in a different format (JSONL, CSV, standard Parquet) have no first-class API to do so.
Proposed solution
- Add
DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet formats
- Add
--output-format / -f flag to the data-designer create CLI command; writes dataset.<format> alongside the parquet batch files
- Default format is
jsonl; the parameter is optional in both the Python API and CLI
Usage
Python API:
results = data_designer.create(config, num_records=1000)
results.export("output.jsonl") # default: jsonl
results.export("output.csv", format="csv")
results.export("output.parquet", format="parquet")
CLI:
data-designer create config.yaml --output-format jsonl
data-designer create config.yaml -n 500 -f csv
Summary
Currently, the only way to persist a generated dataset is via the parquet batch files written internally by the engine. Users who want a single consolidated file in a different format (JSONL, CSV, standard Parquet) have no first-class API to do so.
Proposed solution
DatasetCreationResults.export(path, format=)supporting jsonl, csv, and parquet formats--output-format/-fflag to thedata-designer createCLI command; writesdataset.<format>alongside the parquet batch filesjsonl; the parameter is optional in both the Python API and CLIUsage
Python API:
CLI: