-
Notifications
You must be signed in to change notification settings - Fork 975
Initialize Qwen3.5 mutable buffers during export #17801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -137,6 +137,29 @@ | |||||||||
| } | ||||||||||
|
|
||||||||||
|
|
||||||||||
| def _get_additional_export_passes( | ||||||||||
| model_class: str, | ||||||||||
| ) -> List[InitializedMutableBufferPass]: | ||||||||||
| patterns = [] | ||||||||||
|
|
||||||||||
| if model_class in TORCHTUNE_DEFINED_MODELS: | ||||||||||
| patterns.append("kv_cache_pos") | ||||||||||
|
|
||||||||||
| # Qwen3.5 uses internal mutable buffers for both the hybrid KV path and | ||||||||||
| # DeltaNet recurrent/conv states. | ||||||||||
| if model_class.startswith("qwen3_5"): | ||||||||||
| patterns.extend( | ||||||||||
| [ | ||||||||||
| "k_cache", | ||||||||||
| "v_cache", | ||||||||||
|
Comment on lines
+153
to
+154
|
||||||||||
| "k_cache", | |
| "v_cache", | |
| ".k_cache", | |
| ".v_cache", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initializing KV cache buffers via InitializedMutableBufferPass will cause their full tensor contents to be serialized into the .pte (the emitter treats et_init_buffer+mutable_buffer as const). For k_cache/v_cache this can be extremely large (per-layer [B, H, S, D]) and may blow up export size and load time. Consider avoiding initializing the full KV caches at export (e.g., only init the small state buffers like conv_state/recurrent_state, or add a runtime/cache-reset path that deterministically zeros these buffers without serializing them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. @Phineas1500 does qwen3.5 require initial state for the kv-cache, conv_state and recurrent_state?
The
InitializedMutableBufferPassis only required for mutable buffers with initial state.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like ~5mb size increase from including initial state. Not too sure why - was expecting a bit more.
Output is the same with temp=0
Seems like the state is already zeroed here?
https://github.com/pytorch/executorch/blob/main/examples/models/llama/attention.py#L720