Skip to content

OOM when using USP=True for accelerating inference #1164

@INV-WZQ

Description

@INV-WZQ

Problem

I am running the Wan2.2-I2V-A14B model on an 8xH100 server using use_usp=True

Despite each H100 having 80GB VRAM, I encounter an OutOfMemoryError almost immediately during the model loading phase.

Error

....
[rank4]: Traceback (most recent call last):
[rank4]:   File "/home/DiffSynth-Studio/./examples/wanvideo/model_training/validate_lora/Wan2.2-I2V-A14B.py", line 12, in <module>
[rank4]:     pipe = WanVideoPipeline.from_pretrained(
[rank4]:   File "/home/DiffSynth-Studio/diffsynth/pipelines/wan_video.py", line 130, in from_pretrained
[rank4]:     model_pool = pipe.download_and_load_models(model_configs, vram_limit)
[rank4]:   File "/home/DiffSynth-Studio/diffsynth/diffusion/base_pipeline.py", line 289, in download_and_load_models
[rank4]:     model_pool.auto_load_model(
[rank4]:   File "/home/DiffSynth-Studio/diffsynth/models/model_loader.py", line 70, in auto_load_model
[rank4]:     model = self.load_model_file(config, path, vram_config, vram_limit=vram_limit)
[rank4]:   File "/home/DiffSynth-Studio/diffsynth/models/model_loader.py", line 40, in load_model_file
[rank4]:     model = load_model(
[rank4]:   File "/home/DiffSynth-Studio/diffsynth/core/loader/model.py", line 48, in load_model
[rank4]:     state_dict = {i: state_dict[i] for i in state_dict}
[rank4]:   File "/home/DiffSynth-Studio/diffsynth/core/loader/model.py", line 48, in <dictcomp>
[rank4]:     state_dict = {i: state_dict[i] for i in state_dict}
[rank4]:   File "/home/DiffSynth-Studio/diffsynth/core/vram/disk_map.py", line 62, in __getitem__
[rank4]:     param = self.files[file_id].get_tensor(name)
[rank4]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 220.75 MiB is free. Process 589416 has 4.82 GiB memory in use. Process 589420 has 28.22 GiB memory in use. Including non-PyTorch memory, this process has 17.16 GiB memory in use. Process 589413 has 520.00 MiB memory in use. Process 589419 has 28.22 GiB memory in use. Of the allocated memory 16.00 GiB is allocated by PyTorch, and 580.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
....
[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 84.75 MiB is free. Including non-PyTorch memory, this process has 4.95 GiB memory in use. Process 589420 has 28.22 GiB memory in use. Process 589417 has 17.16 GiB memory in use. Process 589413 has 520.00 MiB memory in use. Process 589419 has 28.22 GiB memory in use. Of the allocated memory 4.14 GiB is allocated by PyTorch, and 220.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
....

Code

import os
import torch
from PIL import Image
from diffsynth.utils.data import save_video, VideoData
from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig
from modelscope import dataset_snapshot_download


pipe = WanVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Wan-AI/Wan2.2-I2V-A14B", origin_file_pattern="high_noise_model/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Wan-AI/Wan2.2-I2V-A14B", origin_file_pattern="low_noise_model/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Wan-AI/Wan2.2-I2V-A14B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth"),
        ModelConfig(model_id="Wan-AI/Wan2.2-I2V-A14B", origin_file_pattern="Wan2.1_VAE.pth"),
    ],
    use_usp=True,
)

Setting

Device: 8*H100
Cuda: 12.8

torch==2.9.1
diffsynth==2.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions