[WIP] Async checkpointing #3701

S1ro1 · 2025-08-01T01:30:18Z

Very much WIP, overrides bunch of stuff I'm not sure that is stable to do.
TODO: discuss if we want to do a bit different approach (and more easily maintainable)

S1ro1 · 2025-08-01T01:32:01Z

src/accelerate/dist_checkpointing.py

+    from accelerate import Accelerator
+
+
+class AccelerateStorageWriter(FileSystemWriter):


This class is the issue: I'm overriding quite interesting stuff from Pytorch that idk if I should (asked on their slack if it's safe). If we don't have this, we can't save optimizer into 1 directory and model into another, which we currently do

S1ro1 · 2025-08-01T01:34:50Z

src/accelerate/dist_checkpointing.py

+        model_storage_md, optim_storage_md = {}, {}
+        for wr_list in results:
+            for wr in wr_list:
+                new_index = dataclasses.asdict(wr.index)


WriteResult dataclass is frozen (which tells a lot about what kind of war crimes I do here), so we have to use some fancy python things to avoid that.

S1ro1 · 2025-08-01T01:36:25Z

src/accelerate/dist_checkpointing.py

+            result = []
+            for to_get in ["model", "optim"]:
+                result.append(
+                    Metadata(


By default DCP thinks we're saving an object called "state" into 1 directory, which we're not. We're saving "optimizer" into 1 subdirectory and "model" into another. That's why we have to update the metadata (remove the "state" prefix and split it into 2)

S1ro1 · 2025-08-01T01:37:35Z

src/accelerate/dist_checkpointing.py

+            self.fs.rename(tmp_path, metadata_path)
+
+
+def save_model_and_optimizer(


This is the only "public" api that we expose, not even. We only use this internally in accelerator.save_state.

HuggingFaceDocBuilderDev · 2025-08-01T01:38:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2025-08-31T15:06:46Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

S1ro1 · 2025-09-08T15:07:41Z

bump

github-actions · 2025-10-10T15:07:38Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

romitjain · 2025-10-13T12:23:56Z

Hi @S1ro1, I am eagerly waiting for this to be merged. Any idea of how much time it might take?
Is there any help that I can provide (tests/dev)?

Also, a side question, looking at the PR, it seems like it doesn't support safetensor serialization. Are there plans to support it, or are you open to contributions for that?

Thanks

github-actions · 2025-11-06T15:08:36Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

romitjain · 2025-11-12T07:55:15Z

@S1ro1 Just a gentle reminder. Let me know if you'd like me to take a stab at any pending changes, and you can review?
Thanks

romitjain · 2025-11-13T05:58:19Z

cc: @SunMarc

SunMarc · 2025-11-13T13:22:37Z

@S1ro1 is not working at HF anymore, so feel free to take over his PR if you want @romitjain !

romitjain · 2025-11-13T16:43:42Z

Sure @SunMarc.
Let me get back on this next week

WIP: very much wip but works (probably)

354b0b5

S1ro1 commented Aug 1, 2025

View reviewed changes

S1ro1 changed the title ~~WIP: [Async checkpointing]~~ [WIP] Async checkpointing Aug 1, 2025

S1ro1 mentioned this pull request Aug 25, 2025

Does FSDP with Accelerate support async checkpoint saving (e.g. via torch.distributed.checkpoint)? #3746

Closed

github-actions bot closed this Sep 8, 2025

S1ro1 reopened this Sep 8, 2025

S1ro1 added 2 commits September 13, 2025 14:16

Merge branch 'main' into feat/async-checkpointing

571ca02

Some stuff

5004251

SunMarc added the contributions-welcome label Nov 13, 2025

		from accelerate import Accelerator


		class AccelerateStorageWriter(FileSystemWriter):

		self.fs.rename(tmp_path, metadata_path)


		def save_model_and_optimizer(

[WIP] Async checkpointing #3701

Are you sure you want to change the base?

[WIP] Async checkpointing #3701

Conversation

S1ro1 commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

S1ro1 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

S1ro1 Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

S1ro1 Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

S1ro1 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 31, 2025

Uh oh!

S1ro1 commented Sep 8, 2025

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

romitjain commented Oct 13, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

romitjain commented Nov 12, 2025

Uh oh!

romitjain commented Nov 13, 2025

Uh oh!

SunMarc commented Nov 13, 2025

Uh oh!

romitjain commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

S1ro1 commented Aug 1, 2025 •

edited

Loading

S1ro1 Aug 1, 2025 •

edited

Loading

S1ro1 Aug 1, 2025 •

edited

Loading