Prefill+decode gpt oss by ochougul · Pull Request #608 · quic/efficient-transformers

ochougul · 2025-11-05T06:22:53Z

We should be using disaggragate serving for GPTOSS model for best performance

GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok
We use read all experts only once always strategy in prefill-only model
And we treat weights activtions meaning read only chosen experts for decode-only model

Prefill-only model

Blocking default behviour when `prefill_only=True` in compile API

NUM_Q_BLOCKS= set number of Q blocks in attention
NUM_FFN_BLOCKS= set number of blocks in FFN
ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs
prefix_caching is not supported with this mode

Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API

Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default
This model can be used for prefix_caching by passing kv_cache_batch_size=<int> in compile API

Decode-only model

Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API

This reduces the amount of DDR used by the model
CB is enabled for this version pass continous_batching=True in from_pretrained call and strictly pass full_batch_size=<int> and optinally kv_cache_batch_size=<int> if needed

Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API

This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention
CB is enabled for this version pass continous_batching=True in from_pretrained call and strictly pass full_batch_size=<int> and optinally kv_cache_batch_size=<int> if needed
This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers

NOTE:

decode-only model currently fails compilation with use_onnx_subfunctions=True so avoid using it
120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as node_precision_info=<path to file>
It is advised to use use_onnx_subfunctions=True with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error

quic-hemagnih · 2025-11-05T15:47:18Z

CI is failing for this PR, https://qraniumci.qualcomm.com/blue/organizations/jenkins/quic_efficient-transformer_public/detail/PR-608/1/pipeline/

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

…taining full KV for decode-only model Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

# We should be using disaggragate serving for GPTOSS model for best performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model # Prefill-only model ## Blocking default behviour when `prefill_only=True` in compile API - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode ## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API # Decode-only model ## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed ## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com> Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com> Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>

# We should be using disaggragate serving for GPTOSS model for best performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model # Prefill-only model ## Blocking default behviour when `prefill_only=True` in compile API - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode ## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API # Decode-only model ## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed ## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com> Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com> Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com> Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com>

performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com> Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com> Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>

# We should be using disaggragate serving for GPTOSS model for best performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model # Prefill-only model ## Blocking default behviour when `prefill_only=True` in compile API - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode ## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API # Decode-only model ## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed ## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com> Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com> Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>

ochougul requested review from quic-amitraj, quic-hemagnih and quic-rishinr as code owners November 5, 2025 06:22

ochougul self-assigned this Nov 6, 2025

ochougul added enhancement New feature or request 1.21.0 labels Nov 6, 2025

ochougul mentioned this pull request Nov 18, 2025

Add ONNX Sub Functions Export Feature for AutoModelForCausalLM #621

Merged

ochougul force-pushed the prefill+decode_gpt_oss branch from 5338048 to a8ebc0f Compare November 24, 2025 21:10

ochougul force-pushed the prefill+decode_gpt_oss branch from d856cd9 to e8d1128 Compare December 9, 2025 12:55

quic-mamta force-pushed the prefill+decode_gpt_oss branch from 626dbda to e8d1128 Compare December 10, 2025 08:26

vbaddi and others added 19 commits December 10, 2025 14:41

[QEff]: Add gpt_oss

af0e6a7

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

nit: update modeling and make transform uniform

2d442eb

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

apirunner change

ab8cc9c

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added test along with simplified Hybridcache

e7ecc19

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added test assert

a583265

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

nit: update test gpt file

dc2cc2a

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

nit: update modeling with new decode moe forward

f8dac17

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

nit: seperate gate, up projections for MoE

99815cf

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

nit: remove test file and add sample test in config

4948397

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Enable CB for GptOssModel

bde09c7

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Fix tests

3fe07a8

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

Address review comments

3fa01df

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

prefill only changes for gpt-oss

4f910e0

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

fixed mapping

88f9f75

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added test

aac4be0

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added test

1d7220a

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

made example not ugly

51316d5

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

fixed tests

e6e2969

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

fixed tests

2334056

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

ochougul added 10 commits December 10, 2025 14:46

added disagg mode example for chunking mode

80571aa

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

fixed the kwargs passing to build_decode_specialization

c403ba7

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

pushed latest changes with chunking enabled for prefill along with re…

3defe4c

…taining full KV for decode-only model Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added support for prefix caching for gpt-oss

dc546ae

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

removed error

3b777e8

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added errors for prefill-only mode

ba77602

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

fix decode-only model

0680508

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

fixed CB for decode-only model

be5ef75

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

created readme

cc3bb0b

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

rebased and made setup_onnx_sub explicit

efd671a

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

ochougul force-pushed the prefill+decode_gpt_oss branch from aabd446 to efd671a Compare December 10, 2025 14:51

ochougul and others added 9 commits December 10, 2025 14:58

linting error

86733cc

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

fixed use_onnx_subfunc

d46c9d0

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

fixed tests

82caac6

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

linter

65f93b1

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

added missing marker

edbc7e8

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

pushed tests fix

4270d2c

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

fixed flux pipeline

85b23cd

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

tests fixed

c78ec66

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Fix CI error for PL=1

502d289

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

quic-mamta force-pushed the prefill+decode_gpt_oss branch from cc5183f to 502d289 Compare December 14, 2025 08:25

Merge branch 'main' into prefill+decode_gpt_oss

49bb40b

quic-mamta merged commit a036e97 into main Dec 14, 2025
4 of 5 checks passed

quic-rishinr deleted the prefill+decode_gpt_oss branch March 30, 2026 04:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefill+decode gpt oss#608

Prefill+decode gpt oss#608
quic-mamta merged 51 commits intomainfrom
prefill+decode_gpt_oss

ochougul commented Nov 5, 2025 •

edited

Loading

Uh oh!

quic-hemagnih commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ochougul commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

We should be using disaggragate serving for GPTOSS model for best performance

Prefill-only model

Blocking default behviour when prefill_only=True in compile API

Chunking pass enable_chunking=True and prefill_only=True in compile API

Decode-only model

Retain Sliding window length of KV for sliding window layers, default behavour when prefill_seq_len=1 in compile API

Full KV for sliding window layers pass retain_full_kv=True along with prefill_seq_len=1 in compile API

Uh oh!

quic-hemagnih commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ochougul commented Nov 5, 2025 •

edited

Loading

Blocking default behviour when `prefill_only=True` in compile API

Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API

Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API

Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API