Bucket/coalesced tensor broadcast in non-colocated refit

**Is your feature request related to a problem? Please describe.**
For MoE models like qwen 30B or deepseek, need coalesced/bucketized tensor broadcast in non-colocated refit to boost the broadcast effective bandwidth. 

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bucket/coalesced tensor broadcast in non-colocated refit #1286

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bucket/coalesced tensor broadcast in non-colocated refit #1286

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions