Skip to content

Conversation

@S1ro1
Copy link
Contributor

@S1ro1 S1ro1 commented Aug 1, 2025

Very much WIP, overrides bunch of stuff I'm not sure that is stable to do.
TODO: discuss if we want to do a bit different approach (and more easily maintainable)

from accelerate import Accelerator


class AccelerateStorageWriter(FileSystemWriter):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is the issue: I'm overriding quite interesting stuff from Pytorch that idk if I should (asked on their slack if it's safe). If we don't have this, we can't save optimizer into 1 directory and model into another, which we currently do

model_storage_md, optim_storage_md = {}, {}
for wr_list in results:
for wr in wr_list:
new_index = dataclasses.asdict(wr.index)
Copy link
Contributor Author

@S1ro1 S1ro1 Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriteResult dataclass is frozen (which tells a lot about what kind of war crimes I do here), so we have to use some fancy python things to avoid that.

result = []
for to_get in ["model", "optim"]:
result.append(
Metadata(
Copy link
Contributor Author

@S1ro1 S1ro1 Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default DCP thinks we're saving an object called "state" into 1 directory, which we're not. We're saving "optimizer" into 1 subdirectory and "model" into another. That's why we have to update the metadata (remove the "state" prefix and split it into 2)

self.fs.rename(tmp_path, metadata_path)


def save_model_and_optimizer(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only "public" api that we expose, not even. We only use this internally in accelerator.save_state.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@S1ro1 S1ro1 changed the title WIP: [Async checkpointing] [WIP] Async checkpointing Aug 1, 2025
@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Sep 8, 2025
@S1ro1
Copy link
Contributor Author

S1ro1 commented Sep 8, 2025

bump

@S1ro1 S1ro1 reopened this Sep 8, 2025
@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@romitjain
Copy link

Hi @S1ro1, I am eagerly waiting for this to be merged. Any idea of how much time it might take?
Is there any help that I can provide (tests/dev)?

Also, a side question, looking at the PR, it seems like it doesn't support safetensor serialization. Are there plans to support it, or are you open to contributions for that?

Thanks

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@romitjain
Copy link

@S1ro1 Just a gentle reminder. Let me know if you'd like me to take a stab at any pending changes, and you can review?
Thanks

@romitjain
Copy link

cc: @SunMarc

@SunMarc
Copy link
Member

SunMarc commented Nov 13, 2025

@S1ro1 is not working at HF anymore, so feel free to take over his PR if you want @romitjain !

@romitjain
Copy link

Sure @SunMarc.
Let me get back on this next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants