-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
Describe the bug
CheckpointEngine.commit(info: CheckpointCommitInfo) interface does not align with DeepSpeedEngine reference.
Line 3527 in runtime/engine.py should be
self.checkpoint_engine.commit(commit_info)
[rank0]: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1381, in save_checkpoint
[rank0]: self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
[rank0]: File "/scratch/azureml/cr/j/85a4996e3fa242ed9f68c4faddc40a52/exe/wd/src/smile_lightning/core/strategies/deepspeed.py", line 103, in save_checkpoint
[rank0]: self.deepspeed_engine.save_checkpoint(
[rank0]: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3527, in save_checkpoint
[rank0]: self.checkpoint_engine.commit(tag)
[rank0]: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/deepspeed/runtime/checkpoint_engine/nebula_checkpoint_engine.py", line 101, in commit
[rank0]: tag = info.tag
[rank0]: AttributeError: 'str' object has no attribute 'tag'
To Reproduce
Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
Expected behavior
A clear and concise description of what you expected to happen.
ds_report output
Please run ds_report to give us details about your setup.
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each]
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
Docker context
Are you using a specific docker image that you can share?
Additional context
Add any other context about the problem here.