Is your feature request related to a problem? Please describe.
For MoE models like qwen 30B or deepseek, need coalesced/bucketized tensor broadcast in non-colocated refit to boost the broadcast effective bandwidth.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
For MoE models like qwen 30B or deepseek, need coalesced/bucketized tensor broadcast in non-colocated refit to boost the broadcast effective bandwidth.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.