-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
When loading huge models like Qwen3-30B on multiple gpus, z1/2 perform tensor flattening on cpu, with potentially 8 gpus competing with each other, which makes deepspeed.initialize excruciatingly slow while putting a huge load on cpu, making the whole node very sluggish.
repro (change train_batch_size to the world size (i.e. 1 if using 1 gpu)
# cat t.py
import sys, deepspeed, torch
from transformers import AutoModel, AutoConfig;
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
c = AutoConfig.from_pretrained(model)
c.num_hidden_layers=16 # cut the number of layers for faster loading
m = AutoModel.from_pretrained(model, config=c)
dsc = dict(
train_micro_batch_size_per_gpu=1,
train_batch_size=8,
zero_optimization=dict(stage=2),
bf16=dict(enabled=True),
optimizer=dict(type="AdamW"),
gradient_accumulation_steps=1,
)
deepspeed.initialize(model=m, config=dsc)
torch.distributed.destroy_process_group()
Note that the repro script shortens the model to 16 layers from the full 48 to have mercy on the person debugging it. (you can lower it more as well)
then run:
$ time deepspeed --num_gpus 8 t.py
real 20m35.321s
user 2244m0.271s
sys 88m31.586s
now exactly the same but with 1 gpu (change to train_batch_size=1 above),
$ time deepspeed --num_gpus 1 t.py
real 3m11.001s
user 49m56.073s
sys 9m16.009s
2min of which are used to load the model weights. So we are looking at 19m35s vs 1m11s of deepspeed.initialize time - so a huge difference.
now this model is 48 layers so to initialize the full model more than 1h will be needed, even though loading the full model is only ~5min
Currently the flattening of tensors happens on cpu, which is a legacy implementation when gpu memory was very small. There is absolutely no reason not to run this on gpu, which would be probably at least 10x faster.