Z1/2 should flatten tensors on gpu

When loading huge models like Qwen3-30B on multiple gpus, z1/2 perform tensor flattening on cpu, with potentially 8 gpus competing with each other, which makes `deepspeed.initialize` excruciatingly slow while putting a huge load on cpu, making the whole node very sluggish.

repro (change `train_batch_size` to the world size (i.e. 1 if using 1 gpu)
```
# cat t.py
import sys, deepspeed, torch
from transformers import AutoModel, AutoConfig;
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
c = AutoConfig.from_pretrained(model)
c.num_hidden_layers=16 # cut the number of layers for faster loading
m = AutoModel.from_pretrained(model, config=c)
dsc = dict(
    train_micro_batch_size_per_gpu=1,
    train_batch_size=8,
    zero_optimization=dict(stage=2),
    bf16=dict(enabled=True),
    optimizer=dict(type="AdamW"),
    gradient_accumulation_steps=1,
)
deepspeed.initialize(model=m, config=dsc)
torch.distributed.destroy_process_group()
```

Note that the repro script shortens the model to 16 layers from the full 48 to have mercy on the person debugging it. (you can lower it more as well)

then run:
```
$ time deepspeed --num_gpus 8 t.py
real    20m35.321s
user    2244m0.271s
sys     88m31.586s
```

now exactly the same but with 1 gpu (change to `train_batch_size=1` above),
```
$ time deepspeed --num_gpus 1 t.py
real    3m11.001s
user    49m56.073s
sys     9m16.009s
```

2min of which are used to load the model weights. So we are looking at 19m35s vs 1m11s of deepspeed.initialize time - so a huge difference.

now this model is 48 layers so to initialize the full model more than 1h will be needed, even though loading the full model is only ~5min

Currently the flattening of tensors happens on cpu, which is a legacy implementation when gpu memory was very small. There is absolutely no reason not to run this on gpu, which would be probably at least 10x faster.

@tjruwase 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Z1/2 should flatten tensors on gpu #7677

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Z1/2 should flatten tensors on gpu #7677

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions