Skip to content

[BUG] nebula checkpoint engine AttributeError: 'str' object has no attribute 'tag' #7678

@unavailableun

Description

@unavailableun

Describe the bug
CheckpointEngine.commit(info: CheckpointCommitInfo) interface does not align with DeepSpeedEngine reference.

Image

Line 3527 in runtime/engine.py should be
self.checkpoint_engine.commit(commit_info)

Image

[rank0]: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1381, in save_checkpoint
[rank0]: self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
[rank0]: File "/scratch/azureml/cr/j/85a4996e3fa242ed9f68c4faddc40a52/exe/wd/src/smile_lightning/core/strategies/deepspeed.py", line 103, in save_checkpoint
[rank0]: self.deepspeed_engine.save_checkpoint(
[rank0]: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3527, in save_checkpoint
[rank0]: self.checkpoint_engine.commit(tag)
[rank0]: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/deepspeed/runtime/checkpoint_engine/nebula_checkpoint_engine.py", line 101, in commit
[rank0]: tag = info.tag
[rank0]: AttributeError: 'str' object has no attribute 'tag'

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
Please run ds_report to give us details about your setup.

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. two machines with x8 A100s each]
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version
  • Any other relevant info about your setup

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions