Add flatten_indices option to save_to_disk method #7862
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Added flatten_indices parameter to control index flattening during dataset saving.
Solves #7861
This PR introduces a new optional argument, flatten_indices, to the save_to_disk methods in both Dataset and DatasetDict.
The change allows users to skip the expensive index-flattening step when saving datasets that already use index mappings (e.g., after filter() or shuffle()), resulting in significant speed improvements for large datasets while maintaining backward compatibility.
While not a huge absolute difference at 100K rows, the improvement scales significantly with larger datasets (millions of rows).
This patch gives users control — they can disable flattening when they don’t need it, avoiding unnecessary rewrites.
@lhoestq WDYT?