Add flatten_indices option to save_to_disk method #7862

ArjunJagdale · 2025-11-12T19:38:51Z

Added flatten_indices parameter to control index flattening during dataset saving.
Solves #7861

This PR introduces a new optional argument, flatten_indices, to the save_to_disk methods in both Dataset and DatasetDict.

The change allows users to skip the expensive index-flattening step when saving datasets that already use index mappings (e.g., after filter() or shuffle()), resulting in significant speed improvements for large datasets while maintaining backward compatibility.

While not a huge absolute difference at 100K rows, the improvement scales significantly with larger datasets (millions of rows).

This patch gives users control — they can disable flattening when they don’t need it, avoiding unnecessary rewrites.

@lhoestq WDYT?

Added flatten_indices parameter to control index flattening during dataset saving.

ArjunJagdale · 2025-11-12T19:48:23Z

as said by @KCKawalkar used below script to test -

BEFORE PATCH -
TEST.PY:

from datasets import Dataset
import time

dataset = Dataset.from_dict({'text': [f'sample {i}' for i in range(100000)]})

# Baseline save (no indices)
start = time.time()
dataset.save_to_disk('baseline')
baseline_time = time.time() - start

# Filtered save (creates indices)
filtered = dataset.filter(lambda x: True)
start = time.time()
filtered.save_to_disk('filtered')
filtered_time = time.time() - start

print(f"Baseline: {baseline_time:.3f}s")
print(f"Filtered: {filtered_time:.3f}s")
print(f"Slowdown: {(filtered_time/baseline_time-1)*100:.1f}%")

RESULTS:

@ArjunJagdale ➜ /workspaces/datasets (main) $ python test_arjun.py
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 3030654.07 examples/s]
Filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 576296.61 examples/s]
Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 310565.19 examples/s]
Baseline: 0.035s
Filtered: 0.323s
Slowdown: 813.4%

AFTER PATCH -
TEST.PY:

from datasets import Dataset
import time

# Create dataset
dataset = Dataset.from_dict({'text': [f'sample {i}' for i in range(100000)]})

# Baseline save (no indices)
start = time.time()
dataset.save_to_disk('baseline')
baseline_time = time.time() - start

# Filtered save (creates indices)
filtered = dataset.filter(lambda x: True)
start = time.time()
filtered.save_to_disk('filtered', flatten_indices=False)
filtered_time = time.time() - start

print(f"Baseline: {baseline_time:.3f}s")
print(f"Filtered: {filtered_time:.3f}s") 
print(f"Slowdown: {(filtered_time/baseline_time-1)*100:.1f}%")

REESULT:

@ArjunJagdale ➜ /workspaces/datasets (main) $ python test_arjun.py
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 3027482.12 examples/s]
Filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 468901.89 examples/s]
Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 324036.36 examples/s]
Baseline: 0.036s
Filtered: 0.310s
Slowdown: 771.1%

ArjunJagdale added 2 commits November 13, 2025 01:08

Add flatten_indices option to save_to_disk method

1b6aec1

Added flatten_indices parameter to control index flattening during dataset saving.

Add flatten_indices parameter to save function

b01cf77

Update arrow_dataset.py

1e6a284

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add flatten_indices option to save_to_disk method #7862

Add flatten_indices option to save_to_disk method #7862

Uh oh!

ArjunJagdale commented Nov 12, 2025 •

edited

Loading

Uh oh!

ArjunJagdale commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add flatten_indices option to save_to_disk method #7862

Are you sure you want to change the base?

Add flatten_indices option to save_to_disk method #7862

Uh oh!

Conversation

ArjunJagdale commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArjunJagdale commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ArjunJagdale commented Nov 12, 2025 •

edited

Loading