Skip to content

Prefill+decode gpt oss#608

Merged
quic-mamta merged 51 commits intomainfrom
prefill+decode_gpt_oss
Dec 14, 2025
Merged

Prefill+decode gpt oss#608
quic-mamta merged 51 commits intomainfrom
prefill+decode_gpt_oss

Conversation

@ochougul
Copy link
Copy Markdown
Contributor

@ochougul ochougul commented Nov 5, 2025

We should be using disaggragate serving for GPTOSS model for best performance

  • GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok
  • We use read all experts only once always strategy in prefill-only model
  • And we treat weights activtions meaning read only chosen experts for decode-only model

Prefill-only model

Blocking default behviour when prefill_only=True in compile API

  • NUM_Q_BLOCKS= set number of Q blocks in attention
  • NUM_FFN_BLOCKS= set number of blocks in FFN
  • ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs
  • prefix_caching is not supported with this mode

Chunking pass enable_chunking=True and prefill_only=True in compile API

  • Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default
  • This model can be used for prefix_caching by passing kv_cache_batch_size=<int> in compile API

Decode-only model

Retain Sliding window length of KV for sliding window layers, default behavour when prefill_seq_len=1 in compile API

  • This reduces the amount of DDR used by the model
  • CB is enabled for this version pass continous_batching=True in from_pretrained call and strictly pass full_batch_size=<int> and optinally kv_cache_batch_size=<int> if needed

Full KV for sliding window layers pass retain_full_kv=True along with prefill_seq_len=1 in compile API

  • This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention
  • CB is enabled for this version pass continous_batching=True in from_pretrained call and strictly pass full_batch_size=<int> and optinally kv_cache_batch_size=<int> if needed
  • This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers

NOTE:

  • decode-only model currently fails compilation with use_onnx_subfunctions=True so avoid using it
  • 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as node_precision_info=<path to file>
  • It is advised to use use_onnx_subfunctions=True with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error

@quic-hemagnih
Copy link
Copy Markdown
Contributor

vbaddi and others added 19 commits December 10, 2025 14:41
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
…taining full KV for decode-only model

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
@ochougul ochougul force-pushed the prefill+decode_gpt_oss branch from aabd446 to efd671a Compare December 10, 2025 14:51
ochougul and others added 9 commits December 10, 2025 14:58
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
@quic-mamta quic-mamta force-pushed the prefill+decode_gpt_oss branch from cc5183f to 502d289 Compare December 14, 2025 08:25
@quic-mamta quic-mamta merged commit a036e97 into main Dec 14, 2025
4 of 5 checks passed
quic-akuruvil pushed a commit that referenced this pull request Dec 15, 2025
# We should be using disaggragate serving for GPTOSS model for best
performance
- GPT-OSS model has 128/4 for 120b and 32/4 ratio of
total_experts/experts_per_tok
- We use read all experts only once always strategy in prefill-only
model
- And we treat weights activtions meaning read only chosen experts for
decode-only model

# Prefill-only model
## Blocking default behviour when `prefill_only=True` in compile API
 - NUM_Q_BLOCKS=<int> set number of Q blocks in attention 
 - NUM_FFN_BLOCKS=<int> set number of blocks in FFN
- ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we
will be using only valid KVs for given block in Attention reducing MACs
 - prefix_caching is not supported with this mode

## Chunking pass `enable_chunking=True` and `prefill_only=True` in
compile API
- Optimized SWA i.e. reading only valid KV as per diagonal attention
mask is enabled for this version by default
- This model can be used for prefix_caching by passing
`kv_cache_batch_size=<int>` in compile API

# Decode-only model
## Retain Sliding window length of KV for sliding window layers, default
behavour when `prefill_seq_len=1` in compile API
 - This reduces the amount of DDR used by the model
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
## Full KV for sliding window layers pass `retain_full_kv=True` along
with `prefill_seq_len=1` in compile API
- This uses higher DDR as we are retaining ctx_len KV even for sliding
window layers but will be reading only sliding window len kv in
attention
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
- This is enabled for the usecase of multi-turn chat, where we will be
running prefill-> decode and then use cache of prefill as well as decode
combined to again run prefill, so we want to retain full KV for sliding
window layers


NOTE:
* decode-only model currently fails compilation with
`use_onnx_subfunctions=True` so avoid using it
* 120B model needs NPI, there are two versions of NPI one with and
without subfunction both are uploaded here, pass it as
`node_precision_info=<path to file>`
* It is advised to use `use_onnx_subfunctions=True` with prefill-only
model, otherwise the compilation times are too high, with this the model
is supposed to export and fail during compile as it needs assert sdk, so
user is supposed to run this compilation manually by pasting the command
printed in the error

---------

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com>
Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com>
Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
quic-amitraj pushed a commit that referenced this pull request Dec 22, 2025
# We should be using disaggragate serving for GPTOSS model for best
performance
- GPT-OSS model has 128/4 for 120b and 32/4 ratio of
total_experts/experts_per_tok
- We use read all experts only once always strategy in prefill-only
model
- And we treat weights activtions meaning read only chosen experts for
decode-only model

# Prefill-only model
## Blocking default behviour when `prefill_only=True` in compile API
 - NUM_Q_BLOCKS=<int> set number of Q blocks in attention 
 - NUM_FFN_BLOCKS=<int> set number of blocks in FFN
- ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we
will be using only valid KVs for given block in Attention reducing MACs
 - prefix_caching is not supported with this mode

## Chunking pass `enable_chunking=True` and `prefill_only=True` in
compile API
- Optimized SWA i.e. reading only valid KV as per diagonal attention
mask is enabled for this version by default
- This model can be used for prefix_caching by passing
`kv_cache_batch_size=<int>` in compile API

# Decode-only model
## Retain Sliding window length of KV for sliding window layers, default
behavour when `prefill_seq_len=1` in compile API
 - This reduces the amount of DDR used by the model
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
## Full KV for sliding window layers pass `retain_full_kv=True` along
with `prefill_seq_len=1` in compile API
- This uses higher DDR as we are retaining ctx_len KV even for sliding
window layers but will be reading only sliding window len kv in
attention
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
- This is enabled for the usecase of multi-turn chat, where we will be
running prefill-> decode and then use cache of prefill as well as decode
combined to again run prefill, so we want to retain full KV for sliding
window layers


NOTE:
* decode-only model currently fails compilation with
`use_onnx_subfunctions=True` so avoid using it
* 120B model needs NPI, there are two versions of NPI one with and
without subfunction both are uploaded here, pass it as
`node_precision_info=<path to file>`
* It is advised to use `use_onnx_subfunctions=True` with prefill-only
model, otherwise the compilation times are too high, with this the model
is supposed to export and fail during compile as it needs assert sdk, so
user is supposed to run this compilation manually by pasting the command
printed in the error

---------

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com>
Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com>
Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
tchawada pushed a commit to tchawada/QEff_tanisha that referenced this pull request Dec 23, 2025
# We should be using disaggragate serving for GPTOSS model for best
performance
- GPT-OSS model has 128/4 for 120b and 32/4 ratio of
total_experts/experts_per_tok
- We use read all experts only once always strategy in prefill-only
model
- And we treat weights activtions meaning read only chosen experts for
decode-only model

# Prefill-only model
## Blocking default behviour when `prefill_only=True` in compile API
 - NUM_Q_BLOCKS=<int> set number of Q blocks in attention 
 - NUM_FFN_BLOCKS=<int> set number of blocks in FFN
- ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we
will be using only valid KVs for given block in Attention reducing MACs
 - prefix_caching is not supported with this mode

## Chunking pass `enable_chunking=True` and `prefill_only=True` in
compile API
- Optimized SWA i.e. reading only valid KV as per diagonal attention
mask is enabled for this version by default
- This model can be used for prefix_caching by passing
`kv_cache_batch_size=<int>` in compile API

# Decode-only model
## Retain Sliding window length of KV for sliding window layers, default
behavour when `prefill_seq_len=1` in compile API
 - This reduces the amount of DDR used by the model
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
## Full KV for sliding window layers pass `retain_full_kv=True` along
with `prefill_seq_len=1` in compile API
- This uses higher DDR as we are retaining ctx_len KV even for sliding
window layers but will be reading only sliding window len kv in
attention
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
- This is enabled for the usecase of multi-turn chat, where we will be
running prefill-> decode and then use cache of prefill as well as decode
combined to again run prefill, so we want to retain full KV for sliding
window layers


NOTE:
* decode-only model currently fails compilation with
`use_onnx_subfunctions=True` so avoid using it
* 120B model needs NPI, there are two versions of NPI one with and
without subfunction both are uploaded here, pass it as
`node_precision_info=<path to file>`
* It is advised to use `use_onnx_subfunctions=True` with prefill-only
model, otherwise the compilation times are too high, with this the model
is supposed to export and fail during compile as it needs assert sdk, so
user is supposed to run this compilation manually by pasting the command
printed in the error

---------

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com>
Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com>
Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
abhishek-singh591 pushed a commit to abhishek-singh591/quic_abhishek that referenced this pull request Jan 2, 2026
# We should be using disaggragate serving for GPTOSS model for best
performance
- GPT-OSS model has 128/4 for 120b and 32/4 ratio of
total_experts/experts_per_tok
- We use read all experts only once always strategy in prefill-only
model
- And we treat weights activtions meaning read only chosen experts for
decode-only model

# Prefill-only model
## Blocking default behviour when `prefill_only=True` in compile API
 - NUM_Q_BLOCKS=<int> set number of Q blocks in attention
 - NUM_FFN_BLOCKS=<int> set number of blocks in FFN
- ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we
will be using only valid KVs for given block in Attention reducing MACs
 - prefix_caching is not supported with this mode

## Chunking pass `enable_chunking=True` and `prefill_only=True` in
compile API
- Optimized SWA i.e. reading only valid KV as per diagonal attention
mask is enabled for this version by default
- This model can be used for prefix_caching by passing
`kv_cache_batch_size=<int>` in compile API

# Decode-only model
## Retain Sliding window length of KV for sliding window layers, default
behavour when `prefill_seq_len=1` in compile API
 - This reduces the amount of DDR used by the model
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
## Full KV for sliding window layers pass `retain_full_kv=True` along
with `prefill_seq_len=1` in compile API
- This uses higher DDR as we are retaining ctx_len KV even for sliding
window layers but will be reading only sliding window len kv in
attention
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
- This is enabled for the usecase of multi-turn chat, where we will be
running prefill-> decode and then use cache of prefill as well as decode
combined to again run prefill, so we want to retain full KV for sliding
window layers

NOTE:
* decode-only model currently fails compilation with
`use_onnx_subfunctions=True` so avoid using it
* 120B model needs NPI, there are two versions of NPI one with and
without subfunction both are uploaded here, pass it as
`node_precision_info=<path to file>`
* It is advised to use `use_onnx_subfunctions=True` with prefill-only
model, otherwise the compilation times are too high, with this the model
is supposed to export and fail during compile as it needs assert sdk, so
user is supposed to run this compilation manually by pasting the command
printed in the error

---------

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com>
Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com>
Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com>
abhishek-singh591 pushed a commit to abhishek-singh591/quic_abhishek that referenced this pull request Jan 2, 2026
performance
- GPT-OSS model has 128/4 for 120b and 32/4 ratio of
total_experts/experts_per_tok
- We use read all experts only once always strategy in prefill-only
model
- And we treat weights activtions meaning read only chosen experts for
decode-only model

 - NUM_Q_BLOCKS=<int> set number of Q blocks in attention
 - NUM_FFN_BLOCKS=<int> set number of blocks in FFN
- ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we
will be using only valid KVs for given block in Attention reducing MACs
 - prefix_caching is not supported with this mode

compile API
- Optimized SWA i.e. reading only valid KV as per diagonal attention
mask is enabled for this version by default
- This model can be used for prefix_caching by passing
`kv_cache_batch_size=<int>` in compile API

behavour when `prefill_seq_len=1` in compile API
 - This reduces the amount of DDR used by the model
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
with `prefill_seq_len=1` in compile API
- This uses higher DDR as we are retaining ctx_len KV even for sliding
window layers but will be reading only sliding window len kv in
attention
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
- This is enabled for the usecase of multi-turn chat, where we will be
running prefill-> decode and then use cache of prefill as well as decode
combined to again run prefill, so we want to retain full KV for sliding
window layers

NOTE:
* decode-only model currently fails compilation with
`use_onnx_subfunctions=True` so avoid using it
* 120B model needs NPI, there are two versions of NPI one with and
without subfunction both are uploaded here, pass it as
`node_precision_info=<path to file>`
* It is advised to use `use_onnx_subfunctions=True` with prefill-only
model, otherwise the compilation times are too high, with this the model
is supposed to export and fail during compile as it needs assert sdk, so
user is supposed to run this compilation manually by pasting the command
printed in the error

---------

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com>
Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com>
Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
quic-dhirajku pushed a commit to quic-dhirajku/efficient-transformers that referenced this pull request Jan 2, 2026
# We should be using disaggragate serving for GPTOSS model for best
performance
- GPT-OSS model has 128/4 for 120b and 32/4 ratio of
total_experts/experts_per_tok
- We use read all experts only once always strategy in prefill-only
model
- And we treat weights activtions meaning read only chosen experts for
decode-only model

# Prefill-only model
## Blocking default behviour when `prefill_only=True` in compile API
 - NUM_Q_BLOCKS=<int> set number of Q blocks in attention
 - NUM_FFN_BLOCKS=<int> set number of blocks in FFN
- ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we
will be using only valid KVs for given block in Attention reducing MACs
 - prefix_caching is not supported with this mode

## Chunking pass `enable_chunking=True` and `prefill_only=True` in
compile API
- Optimized SWA i.e. reading only valid KV as per diagonal attention
mask is enabled for this version by default
- This model can be used for prefix_caching by passing
`kv_cache_batch_size=<int>` in compile API

# Decode-only model
## Retain Sliding window length of KV for sliding window layers, default
behavour when `prefill_seq_len=1` in compile API
 - This reduces the amount of DDR used by the model
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
## Full KV for sliding window layers pass `retain_full_kv=True` along
with `prefill_seq_len=1` in compile API
- This uses higher DDR as we are retaining ctx_len KV even for sliding
window layers but will be reading only sliding window len kv in
attention
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
- This is enabled for the usecase of multi-turn chat, where we will be
running prefill-> decode and then use cache of prefill as well as decode
combined to again run prefill, so we want to retain full KV for sliding
window layers

NOTE:
* decode-only model currently fails compilation with
`use_onnx_subfunctions=True` so avoid using it
* 120B model needs NPI, there are two versions of NPI one with and
without subfunction both are uploaded here, pass it as
`node_precision_info=<path to file>`
* It is advised to use `use_onnx_subfunctions=True` with prefill-only
model, otherwise the compilation times are too high, with this the model
is supposed to export and fail during compile as it needs assert sdk, so
user is supposed to run this compilation manually by pasting the command
printed in the error

---------

Signed-off-by: vbaddi <quic_vbaddi@quicinc.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com>
Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com>
Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
@quic-rishinr quic-rishinr deleted the prefill+decode_gpt_oss branch March 30, 2026 04:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1.21.0 enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants