Prefill+decode gpt oss #608

ochougul · 2025-11-05T06:22:53Z

We should be using disaggragate serving for GPTOSS model for best performance

GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok
We use read all experts only once always strategy in prefill-only model
And we treat weights activtions meaning read only chosen experts for decode-only model

Prefill-only model

Blocking default behviour when `prefill_only=True` in compile API

NUM_Q_BLOCKS= set number of Q blocks in attention
NUM_FFN_BLOCKS= set number of blocks in FFN
ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs
prefix_caching is not supported with this mode

Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API

Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default
This model can be used for prefix_caching by passing kv_cache_batch_size=<int> in compile API

Decode-only model

Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API

This reduces the amount of DDR used by the model
CB is enabled for this version pass continous_batching=True in from_pretrained call and strictly pass full_batch_size=<int> and optinally kv_cache_batch_size=<int> if needed

Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API

This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention
CB is enabled for this version pass continous_batching=True in from_pretrained call and strictly pass full_batch_size=<int> and optinally kv_cache_batch_size=<int> if needed
This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers

NOTE:

decode-only model currently fails compilation with use_onnx_subfunctions=True so avoid using it
120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as node_precision_info=<path to file>
It is advised to use use_onnx_subfunctions=True with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error

quic-hemagnih · 2025-11-05T15:47:18Z

CI is failing for this PR, https://qraniumci.qualcomm.com/blue/organizations/jenkins/quic_efficient-transformer_public/detail/PR-608/1/pipeline/

Signed-off-by: vbaddi <[email protected]> Signed-off-by: Onkar Chougule <[email protected]>

Signed-off-by: Onkar Chougule <[email protected]>

Signed-off-by: vbaddi <[email protected]> Signed-off-by: Onkar Chougule <[email protected]>

Signed-off-by: Mamta Singh <[email protected]> Signed-off-by: Onkar Chougule <[email protected]>

Signed-off-by: Mamta Singh <[email protected]>

Signed-off-by: Onkar Chougule <[email protected]>

…fill seq_len for prefill_only gpt_oss model Signed-off-by: Onkar Chougule <[email protected]>

Signed-off-by: Onkar Chougule <[email protected]>

…able_chunking flag to get_specialization for gpt-oss Signed-off-by: Onkar Chougule <[email protected]>

Signed-off-by: Onkar Chougule <[email protected]>

…taining full KV for decode-only model Signed-off-by: Onkar Chougule <[email protected]>

Signed-off-by: Onkar Chougule <[email protected]>

Signed-off-by: Mamta Singh <[email protected]>

ochougul requested review from quic-amitraj, quic-hemagnih and quic-rishinr as code owners November 5, 2025 06:22

ochougul self-assigned this Nov 6, 2025

ochougul added enhancement New feature or request 1.21.0 labels Nov 6, 2025

ochougul mentioned this pull request Nov 18, 2025

Add ONNX Sub Functions Export Feature for AutoModelForCausalLM #621

Merged

ochougul force-pushed the prefill+decode_gpt_oss branch from 5338048 to a8ebc0f Compare November 24, 2025 21:10

ochougul force-pushed the prefill+decode_gpt_oss branch from d856cd9 to e8d1128 Compare December 9, 2025 12:55

quic-mamta force-pushed the prefill+decode_gpt_oss branch from 626dbda to e8d1128 Compare December 10, 2025 08:26

vbaddi and others added 19 commits December 10, 2025 14:41

[QEff]: Add gpt_oss

af0e6a7

Signed-off-by: vbaddi <[email protected]> Signed-off-by: Onkar Chougule <[email protected]>

nit: update modeling and make transform uniform

2d442eb

Signed-off-by: vbaddi <[email protected]> Signed-off-by: Onkar Chougule <[email protected]>

apirunner change

ab8cc9c

Signed-off-by: Onkar Chougule <[email protected]>

added test along with simplified Hybridcache

e7ecc19

Signed-off-by: Onkar Chougule <[email protected]>

added test assert

a583265

Signed-off-by: Onkar Chougule <[email protected]>

nit: update test gpt file

dc2cc2a

Signed-off-by: vbaddi <[email protected]> Signed-off-by: Onkar Chougule <[email protected]>

nit: update modeling with new decode moe forward

f8dac17

Signed-off-by: vbaddi <[email protected]> Signed-off-by: Onkar Chougule <[email protected]>

nit: seperate gate, up projections for MoE

99815cf

Signed-off-by: vbaddi <[email protected]> Signed-off-by: Onkar Chougule <[email protected]>

nit: remove test file and add sample test in config

4948397

Signed-off-by: vbaddi <[email protected]> Signed-off-by: Onkar Chougule <[email protected]>

Enable CB for GptOssModel

bde09c7

Signed-off-by: Mamta Singh <[email protected]> Signed-off-by: Onkar Chougule <[email protected]>

Fix tests

3fe07a8

Signed-off-by: Mamta Singh <[email protected]>

Address review comments

3fa01df

Signed-off-by: Mamta Singh <[email protected]>

prefill only changes for gpt-oss

4f910e0

Signed-off-by: Onkar Chougule <[email protected]>

fixed mapping

88f9f75

Signed-off-by: Onkar Chougule <[email protected]>

added test

aac4be0

Signed-off-by: Onkar Chougule <[email protected]>

added test

1d7220a

Signed-off-by: Onkar Chougule <[email protected]>

made example not ugly

51316d5

Signed-off-by: Onkar Chougule <[email protected]>

fixed tests

e6e2969

Signed-off-by: Onkar Chougule <[email protected]>

fixed tests

2334056

Signed-off-by: Onkar Chougule <[email protected]>

ochougul added 17 commits December 10, 2025 14:46

include num_ffn_blocks in hash

a829a05

Signed-off-by: Onkar Chougule <[email protected]>

fixed dynamic range in case of subfunc issue and nonmatching ctx, pre…

eb8c7c3

…fill seq_len for prefill_only gpt_oss model Signed-off-by: Onkar Chougule <[email protected]>

added swa optimization for reducing MACCs using less KV

f6c320e

Signed-off-by: Onkar Chougule <[email protected]>

added opt swa to hash

69a696d

Signed-off-by: Onkar Chougule <[email protected]>

lint and format

50c9b7f

Signed-off-by: Onkar Chougule <[email protected]>

enabled chunking

a53f7bb

Signed-off-by: Onkar Chougule <[email protected]>

added ChunkedPrefillMLP block; fixed passing prefill_only flag and en…

ff1d05b

…able_chunking flag to get_specialization for gpt-oss Signed-off-by: Onkar Chougule <[email protected]>

added disagg mode example for chunking mode

80571aa

Signed-off-by: Onkar Chougule <[email protected]>

fixed the kwargs passing to build_decode_specialization

c403ba7

Signed-off-by: Onkar Chougule <[email protected]>

pushed latest changes with chunking enabled for prefill along with re…

3defe4c

…taining full KV for decode-only model Signed-off-by: Onkar Chougule <[email protected]>

added support for prefix caching for gpt-oss

dc546ae

Signed-off-by: Onkar Chougule <[email protected]>

removed error

3b777e8

Signed-off-by: Onkar Chougule <[email protected]>

added errors for prefill-only mode

ba77602

Signed-off-by: Onkar Chougule <[email protected]>

fix decode-only model

0680508

Signed-off-by: Onkar Chougule <[email protected]>

fixed CB for decode-only model

be5ef75

Signed-off-by: Onkar Chougule <[email protected]>

created readme

cc3bb0b

Signed-off-by: Onkar Chougule <[email protected]>

rebased and made setup_onnx_sub explicit

efd671a

Signed-off-by: Onkar Chougule <[email protected]>

ochougul force-pushed the prefill+decode_gpt_oss branch from aabd446 to efd671a Compare December 10, 2025 14:51

ochougul and others added 9 commits December 10, 2025 14:58

linting error

86733cc

Signed-off-by: Onkar Chougule <[email protected]>

fixed use_onnx_subfunc

d46c9d0

Signed-off-by: Onkar Chougule <[email protected]>

fixed tests

82caac6

Signed-off-by: Onkar Chougule <[email protected]>

linter

65f93b1

Signed-off-by: Onkar Chougule <[email protected]>

added missing marker

edbc7e8

Signed-off-by: Onkar Chougule <[email protected]>

pushed tests fix

4270d2c

Signed-off-by: Onkar Chougule <[email protected]>

fixed flux pipeline

85b23cd

Signed-off-by: Onkar Chougule <[email protected]>

tests fixed

c78ec66

Signed-off-by: Onkar Chougule <[email protected]>

Fix CI error for PL=1

502d289

Signed-off-by: Mamta Singh <[email protected]>

quic-mamta force-pushed the prefill+decode_gpt_oss branch from cc5183f to 502d289 Compare December 14, 2025 08:25

Merge branch 'main' into prefill+decode_gpt_oss

49bb40b

quic-mamta merged commit a036e97 into main Dec 14, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prefill+decode gpt oss #608

Prefill+decode gpt oss #608

ochougul commented Nov 5, 2025 •

edited

Loading

Uh oh!

quic-hemagnih commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Prefill+decode gpt oss #608

Prefill+decode gpt oss #608

Conversation

ochougul commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

We should be using disaggragate serving for GPTOSS model for best performance

Prefill-only model

Blocking default behviour when prefill_only=True in compile API

Chunking pass enable_chunking=True and prefill_only=True in compile API

Decode-only model

Retain Sliding window length of KV for sliding window layers, default behavour when prefill_seq_len=1 in compile API

Full KV for sliding window layers pass retain_full_kv=True along with prefill_seq_len=1 in compile API

Uh oh!

quic-hemagnih commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ochougul commented Nov 5, 2025 •

edited

Loading

Blocking default behviour when `prefill_only=True` in compile API

Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API

Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API

Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API