Prefill+decode gpt oss#608
Merged
quic-mamta merged 51 commits intomainfrom Dec 14, 2025
Merged
Conversation
Contributor
5338048 to
a8ebc0f
Compare
d856cd9 to
e8d1128
Compare
626dbda to
e8d1128
Compare
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
…taining full KV for decode-only model Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
aabd446 to
efd671a
Compare
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
cc5183f to
502d289
Compare
quic-akuruvil
pushed a commit
that referenced
this pull request
Dec 15, 2025
# We should be using disaggragate serving for GPTOSS model for best performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model # Prefill-only model ## Blocking default behviour when `prefill_only=True` in compile API - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode ## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API # Decode-only model ## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed ## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com> Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com> Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
quic-amitraj
pushed a commit
that referenced
this pull request
Dec 22, 2025
# We should be using disaggragate serving for GPTOSS model for best performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model # Prefill-only model ## Blocking default behviour when `prefill_only=True` in compile API - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode ## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API # Decode-only model ## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed ## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com> Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com> Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
tchawada
pushed a commit
to tchawada/QEff_tanisha
that referenced
this pull request
Dec 23, 2025
# We should be using disaggragate serving for GPTOSS model for best performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model # Prefill-only model ## Blocking default behviour when `prefill_only=True` in compile API - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode ## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API # Decode-only model ## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed ## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com> Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com> Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
abhishek-singh591
pushed a commit
to abhishek-singh591/quic_abhishek
that referenced
this pull request
Jan 2, 2026
# We should be using disaggragate serving for GPTOSS model for best performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model # Prefill-only model ## Blocking default behviour when `prefill_only=True` in compile API - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode ## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API # Decode-only model ## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed ## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com> Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com> Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com> Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com>
abhishek-singh591
pushed a commit
to abhishek-singh591/quic_abhishek
that referenced
this pull request
Jan 2, 2026
performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com> Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com> Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
quic-dhirajku
pushed a commit
to quic-dhirajku/efficient-transformers
that referenced
this pull request
Jan 2, 2026
# We should be using disaggragate serving for GPTOSS model for best performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model # Prefill-only model ## Blocking default behviour when `prefill_only=True` in compile API - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode ## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API # Decode-only model ## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed ## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <quic_vbaddi@quicinc.com> Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com> Signed-off-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: Vinayak Baddi <quic_vbaddi@quicinc.com> Co-authored-by: Vinayak Baddi <vbaddi@qti.qualcomm.com> Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com> Co-authored-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We should be using disaggragate serving for GPTOSS model for best performance
Prefill-only model
Blocking default behviour when
prefill_only=Truein compile APIChunking pass
enable_chunking=Trueandprefill_only=Truein compile APIkv_cache_batch_size=<int>in compile APIDecode-only model
Retain Sliding window length of KV for sliding window layers, default behavour when
prefill_seq_len=1in compile APIcontinous_batching=Trueinfrom_pretrainedcall and strictly passfull_batch_size=<int>and optinallykv_cache_batch_size=<int>if neededFull KV for sliding window layers pass
retain_full_kv=Truealong withprefill_seq_len=1in compile APIcontinous_batching=Trueinfrom_pretrainedcall and strictly passfull_batch_size=<int>and optinallykv_cache_batch_size=<int>if neededNOTE:
use_onnx_subfunctions=Trueso avoid using itnode_precision_info=<path to file>use_onnx_subfunctions=Truewith prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error