Skip to content

Commit cc3bb0b

Browse files
committed
created readme
Signed-off-by: Onkar Chougule <[email protected]>
1 parent be5ef75 commit cc3bb0b

File tree

4 files changed

+206
-0
lines changed

4 files changed

+206
-0
lines changed

examples/disagg_serving/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# We should be using disaggragate serving for GPTOSS model for best performance
2+
- GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok
3+
- We use read all experts only once always strategy in prefill-only model
4+
- And we treat weights activtions meaning read only chosen experts for decode-only model
5+
6+
# Prefill-only model
7+
## Blocking default behviour when `prefill_only=True` in compile API
8+
- NUM_Q_BLOCKS=<int> set number of Q blocks in attention
9+
- NUM_FFN_BLOCKS=<int> set number of blocks in FFN
10+
- ENABLE_OPT_SWA="0" or "1" to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs
11+
- prefix_caching is not supported with this mode
12+
13+
## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API
14+
- Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default
15+
- This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API
16+
17+
# Decode-only model
18+
## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API
19+
- This reduces the amount of DDR used by the model
20+
- CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed
21+
## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API
22+
- This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention
23+
- CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed
24+
- This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers
25+
26+
27+
NOTE:
28+
* decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it
29+
* 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>`
30+
* It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error
31+
File renamed without changes.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
FP32NodeInstanceNames:
2+
- CustomRMSNorm_58
3+
- onnx::Shape_1033777
4+
- CustomRMSNorm_349
5+
- hidden.127
6+
- CustomRMSNorm_27448
7+
- onnx::Shape_1066066
8+
- CustomRMSNorm_27709
9+
- hidden.131
10+
- CustomRMSNorm_54808
11+
- onnx::Shape_878
12+
- CustomRMSNorm_55105
13+
- hidden
14+
- hidden_states.259
15+
- Add_348
16+
- Add_347
17+
- onnx::Add_1034099
18+
- hidden_states.267
19+
- Add_27708
20+
- onnx::Add_1066358
21+
- Add_27707
22+
- hidden_states.3
23+
- Add_55104
24+
- onnx::Add_1209
25+
- Add_55103
26+
- /model/norm/CustomRMSNorm
27+
- /model/norm/CustomRMSNorm_output_0
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
FP32NodeInstanceNames:
2+
- /model/layers.0/Add_1_output_0
3+
- /model/layers.0/Add_output_0
4+
- /model/layers.0/input_layernorm/CustomRMSNorm_output_0
5+
- /model/layers.0/post_attention_layernorm/CustomRMSNorm_output_0
6+
- /model/layers.1/Add_1_output_0
7+
- /model/layers.1/Add_output_0
8+
- /model/layers.1/input_layernorm/CustomRMSNorm_output_0
9+
- /model/layers.1/post_attention_layernorm/CustomRMSNorm_output_0
10+
- /model/layers.10/Add_1_output_0
11+
- /model/layers.10/Add_output_0
12+
- /model/layers.10/input_layernorm/CustomRMSNorm_output_0
13+
- /model/layers.10/post_attention_layernorm/CustomRMSNorm_output_0
14+
- /model/layers.11/Add_1_output_0
15+
- /model/layers.11/Add_output_0
16+
- /model/layers.11/input_layernorm/CustomRMSNorm_output_0
17+
- /model/layers.11/post_attention_layernorm/CustomRMSNorm_output_0
18+
- /model/layers.12/Add_1_output_0
19+
- /model/layers.12/Add_output_0
20+
- /model/layers.12/input_layernorm/CustomRMSNorm_output_0
21+
- /model/layers.12/post_attention_layernorm/CustomRMSNorm_output_0
22+
- /model/layers.13/Add_1_output_0
23+
- /model/layers.13/Add_output_0
24+
- /model/layers.13/input_layernorm/CustomRMSNorm_output_0
25+
- /model/layers.13/post_attention_layernorm/CustomRMSNorm_output_0
26+
- /model/layers.14/Add_1_output_0
27+
- /model/layers.14/Add_output_0
28+
- /model/layers.14/input_layernorm/CustomRMSNorm_output_0
29+
- /model/layers.14/post_attention_layernorm/CustomRMSNorm_output_0
30+
- /model/layers.15/Add_1_output_0
31+
- /model/layers.15/Add_output_0
32+
- /model/layers.15/input_layernorm/CustomRMSNorm_output_0
33+
- /model/layers.15/post_attention_layernorm/CustomRMSNorm_output_0
34+
- /model/layers.16/Add_1_output_0
35+
- /model/layers.16/Add_output_0
36+
- /model/layers.16/input_layernorm/CustomRMSNorm_output_0
37+
- /model/layers.16/post_attention_layernorm/CustomRMSNorm_output_0
38+
- /model/layers.17/Add_1_output_0
39+
- /model/layers.17/Add_output_0
40+
- /model/layers.17/input_layernorm/CustomRMSNorm_output_0
41+
- /model/layers.17/post_attention_layernorm/CustomRMSNorm_output_0
42+
- /model/layers.18/Add_1_output_0
43+
- /model/layers.18/Add_output_0
44+
- /model/layers.18/input_layernorm/CustomRMSNorm_output_0
45+
- /model/layers.18/post_attention_layernorm/CustomRMSNorm_output_0
46+
- /model/layers.19/Add_1_output_0
47+
- /model/layers.19/Add_output_0
48+
- /model/layers.19/input_layernorm/CustomRMSNorm_output_0
49+
- /model/layers.19/post_attention_layernorm/CustomRMSNorm_output_0
50+
- /model/layers.2/Add_1_output_0
51+
- /model/layers.2/Add_output_0
52+
- /model/layers.2/input_layernorm/CustomRMSNorm_output_0
53+
- /model/layers.2/post_attention_layernorm/CustomRMSNorm_output_0
54+
- /model/layers.20/Add_1_output_0
55+
- /model/layers.20/Add_output_0
56+
- /model/layers.20/input_layernorm/CustomRMSNorm_output_0
57+
- /model/layers.20/post_attention_layernorm/CustomRMSNorm_output_0
58+
- /model/layers.21/Add_1_output_0
59+
- /model/layers.21/Add_output_0
60+
- /model/layers.21/input_layernorm/CustomRMSNorm_output_0
61+
- /model/layers.21/post_attention_layernorm/CustomRMSNorm_output_0
62+
- /model/layers.22/Add_1_output_0
63+
- /model/layers.22/Add_output_0
64+
- /model/layers.22/input_layernorm/CustomRMSNorm_output_0
65+
- /model/layers.22/post_attention_layernorm/CustomRMSNorm_output_0
66+
- /model/layers.23/Add_1_output_0
67+
- /model/layers.23/Add_output_0
68+
- /model/layers.23/input_layernorm/CustomRMSNorm_output_0
69+
- /model/layers.23/post_attention_layernorm/CustomRMSNorm_output_0
70+
- /model/layers.24/Add_1_output_0
71+
- /model/layers.24/Add_output_0
72+
- /model/layers.24/input_layernorm/CustomRMSNorm_output_0
73+
- /model/layers.24/post_attention_layernorm/CustomRMSNorm_output_0
74+
- /model/layers.25/Add_1_output_0
75+
- /model/layers.25/Add_output_0
76+
- /model/layers.25/input_layernorm/CustomRMSNorm_output_0
77+
- /model/layers.25/post_attention_layernorm/CustomRMSNorm_output_0
78+
- /model/layers.26/Add_1_output_0
79+
- /model/layers.26/Add_output_0
80+
- /model/layers.26/input_layernorm/CustomRMSNorm_output_0
81+
- /model/layers.26/post_attention_layernorm/CustomRMSNorm_output_0
82+
- /model/layers.27/Add_1_output_0
83+
- /model/layers.27/Add_output_0
84+
- /model/layers.27/input_layernorm/CustomRMSNorm_output_0
85+
- /model/layers.27/post_attention_layernorm/CustomRMSNorm_output_0
86+
- /model/layers.28/Add_1_output_0
87+
- /model/layers.28/Add_output_0
88+
- /model/layers.28/input_layernorm/CustomRMSNorm_output_0
89+
- /model/layers.28/post_attention_layernorm/CustomRMSNorm_output_0
90+
- /model/layers.29/Add_1_output_0
91+
- /model/layers.29/Add_output_0
92+
- /model/layers.29/input_layernorm/CustomRMSNorm_output_0
93+
- /model/layers.29/post_attention_layernorm/CustomRMSNorm_output_0
94+
- /model/layers.3/Add_1_output_0
95+
- /model/layers.3/Add_output_0
96+
- /model/layers.3/input_layernorm/CustomRMSNorm_output_0
97+
- /model/layers.3/post_attention_layernorm/CustomRMSNorm_output_0
98+
- /model/layers.30/Add_1_output_0
99+
- /model/layers.30/Add_output_0
100+
- /model/layers.30/input_layernorm/CustomRMSNorm_output_0
101+
- /model/layers.30/post_attention_layernorm/CustomRMSNorm_output_0
102+
- /model/layers.31/Add_1_output_0
103+
- /model/layers.31/Add_output_0
104+
- /model/layers.31/input_layernorm/CustomRMSNorm_output_0
105+
- /model/layers.31/post_attention_layernorm/CustomRMSNorm_output_0
106+
- /model/layers.32/Add_1_output_0
107+
- /model/layers.32/Add_output_0
108+
- /model/layers.32/input_layernorm/CustomRMSNorm_output_0
109+
- /model/layers.32/post_attention_layernorm/CustomRMSNorm_output_0
110+
- /model/layers.33/Add_1_output_0
111+
- /model/layers.33/Add_output_0
112+
- /model/layers.33/input_layernorm/CustomRMSNorm_output_0
113+
- /model/layers.33/post_attention_layernorm/CustomRMSNorm_output_0
114+
- /model/layers.34/Add_1_output_0
115+
- /model/layers.34/Add_output_0
116+
- /model/layers.34/input_layernorm/CustomRMSNorm_output_0
117+
- /model/layers.34/post_attention_layernorm/CustomRMSNorm_output_0
118+
- /model/layers.35/Add_1_output_0
119+
- /model/layers.35/Add_output_0
120+
- /model/norm/Add_output_0
121+
- /model/layers.35/input_layernorm/CustomRMSNorm_output_0
122+
- /model/layers.35/post_attention_layernorm/CustomRMSNorm_output_0
123+
- /model/layers.4/Add_1_output_0
124+
- /model/layers.4/Add_output_0
125+
- /model/layers.4/input_layernorm/CustomRMSNorm_output_0
126+
- /model/layers.4/post_attention_layernorm/CustomRMSNorm_output_0
127+
- /model/layers.5/Add_1_output_0
128+
- /model/layers.5/Add_output_0
129+
- /model/layers.5/input_layernorm/CustomRMSNorm_output_0
130+
- /model/layers.5/post_attention_layernorm/CustomRMSNorm_output_0
131+
- /model/layers.6/Add_1_output_0
132+
- /model/layers.6/Add_output_0
133+
- /model/layers.6/input_layernorm/CustomRMSNorm_output_0
134+
- /model/layers.6/post_attention_layernorm/CustomRMSNorm_output_0
135+
- /model/layers.7/Add_1_output_0
136+
- /model/layers.7/Add_output_0
137+
- /model/layers.7/input_layernorm/CustomRMSNorm_output_0
138+
- /model/layers.7/post_attention_layernorm/CustomRMSNorm_output_0
139+
- /model/layers.8/Add_1_output_0
140+
- /model/layers.8/Add_output_0
141+
- /model/layers.8/input_layernorm/CustomRMSNorm_output_0
142+
- /model/layers.8/post_attention_layernorm/CustomRMSNorm_output_0
143+
- /model/layers.9/Add_1_output_0
144+
- /model/layers.9/Add_output_0
145+
- /model/layers.9/input_layernorm/CustomRMSNorm_output_0
146+
- /model/layers.9/post_attention_layernorm/CustomRMSNorm_output_0
147+
- /model/norm/CustomRMSNorm_output_0
148+

0 commit comments

Comments
 (0)