Skip to content

Conversation

@ArjunJagdale
Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Nov 12, 2025

Added flatten_indices parameter to control index flattening during dataset saving.
Solves #7861

This PR introduces a new optional argument, flatten_indices, to the save_to_disk methods in both Dataset and DatasetDict.

The change allows users to skip the expensive index-flattening step when saving datasets that already use index mappings (e.g., after filter() or shuffle()), resulting in significant speed improvements for large datasets while maintaining backward compatibility.

While not a huge absolute difference at 100K rows, the improvement scales significantly with larger datasets (millions of rows).

This patch gives users control — they can disable flattening when they don’t need it, avoiding unnecessary rewrites.

@lhoestq WDYT?

Added flatten_indices parameter to control index flattening during dataset saving.
@ArjunJagdale
Copy link
Contributor Author

as said by @KCKawalkar used below script to test -

BEFORE PATCH -
TEST.PY:

from datasets import Dataset
import time

dataset = Dataset.from_dict({'text': [f'sample {i}' for i in range(100000)]})

# Baseline save (no indices)
start = time.time()
dataset.save_to_disk('baseline')
baseline_time = time.time() - start

# Filtered save (creates indices)
filtered = dataset.filter(lambda x: True)
start = time.time()
filtered.save_to_disk('filtered')
filtered_time = time.time() - start

print(f"Baseline: {baseline_time:.3f}s")
print(f"Filtered: {filtered_time:.3f}s")
print(f"Slowdown: {(filtered_time/baseline_time-1)*100:.1f}%")

RESULTS:

@ArjunJagdale ➜ /workspaces/datasets (main) $ python test_arjun.py
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 3030654.07 examples/s]
Filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 576296.61 examples/s]
Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 310565.19 examples/s]
Baseline: 0.035s
Filtered: 0.323s
Slowdown: 813.4%

AFTER PATCH -
TEST.PY:

from datasets import Dataset
import time

# Create dataset
dataset = Dataset.from_dict({'text': [f'sample {i}' for i in range(100000)]})

# Baseline save (no indices)
start = time.time()
dataset.save_to_disk('baseline')
baseline_time = time.time() - start

# Filtered save (creates indices)
filtered = dataset.filter(lambda x: True)
start = time.time()
filtered.save_to_disk('filtered', flatten_indices=False)
filtered_time = time.time() - start

print(f"Baseline: {baseline_time:.3f}s")
print(f"Filtered: {filtered_time:.3f}s") 
print(f"Slowdown: {(filtered_time/baseline_time-1)*100:.1f}%")

REESULT:

@ArjunJagdale ➜ /workspaces/datasets (main) $ python test_arjun.py
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 3027482.12 examples/s]
Filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 468901.89 examples/s]
Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 324036.36 examples/s]
Baseline: 0.036s
Filtered: 0.310s
Slowdown: 771.1%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant