Skip to content

Z1/2 should flatten tensors on gpu #7677

@stas00

Description

@stas00

When loading huge models like Qwen3-30B on multiple gpus, z1/2 perform tensor flattening on cpu, with potentially 8 gpus competing with each other, which makes deepspeed.initialize excruciatingly slow while putting a huge load on cpu, making the whole node very sluggish.

repro (change train_batch_size to the world size (i.e. 1 if using 1 gpu)

# cat t.py
import sys, deepspeed, torch
from transformers import AutoModel, AutoConfig;
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
c = AutoConfig.from_pretrained(model)
c.num_hidden_layers=16 # cut the number of layers for faster loading
m = AutoModel.from_pretrained(model, config=c)
dsc = dict(
    train_micro_batch_size_per_gpu=1,
    train_batch_size=8,
    zero_optimization=dict(stage=2),
    bf16=dict(enabled=True),
    optimizer=dict(type="AdamW"),
    gradient_accumulation_steps=1,
)
deepspeed.initialize(model=m, config=dsc)
torch.distributed.destroy_process_group()

Note that the repro script shortens the model to 16 layers from the full 48 to have mercy on the person debugging it. (you can lower it more as well)

then run:

$ time deepspeed --num_gpus 8 t.py
real    20m35.321s
user    2244m0.271s
sys     88m31.586s

now exactly the same but with 1 gpu (change to train_batch_size=1 above),

$ time deepspeed --num_gpus 1 t.py
real    3m11.001s
user    49m56.073s
sys     9m16.009s

2min of which are used to load the model weights. So we are looking at 19m35s vs 1m11s of deepspeed.initialize time - so a huge difference.

now this model is 48 layers so to initialize the full model more than 1h will be needed, even though loading the full model is only ~5min

Currently the flattening of tensors happens on cpu, which is a legacy implementation when gpu memory was very small. There is absolutely no reason not to run this on gpu, which would be probably at least 10x faster.

@tjruwase

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions