[Unity][Dlight] Handle Epilogue Broadcasting #15252
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the Decode-GEMV scheduling by further analyzing its epilogue pattern.
The existing behavior assumes that the outcome of cross-thread reduction stays in register files local to each thread, which is further used to calculate the epilogue in the same thread.
This strategy means the cross-thread reduction outcome is stored only on thread 0, while the other threads cannot participate in subsequent computation (i.e. epilogue). Related: #15192.
When the epilogue is relatively lightweight, i.e. elementwise add, casting on scalars, this strategy is optimal. However, once the outcome needs to be broadcasted to compute over a non-trivial region, for example, act as a normalizer of
np.mean, it would become much slower because only one thread in a thread block is effectively used.In this case, we will need to broadcast the cross-thread reduction outcome in shared memory, making it visible to other threads, and then bind the compute region to all threads in the threadblock.