issue with the implementation of column_sum_reduce by zmxdream · Pull Request #804 · deepspeedai/DeepSpeed

zmxdream · 2021-02-28T12:14:06Z

hi, i take a look at the code of column_sum_reduce, i have 2 questions:

the goal of column_sum_reduce is to get the column sum of inp matrix with shape[rows, width] and the result shape should be [width],right ? It seems that the judgment condition of pos is not suitable
the implementation of cuda kernel based on the asumption that, the thread with same threadIdx.y will group into a thread_block_tile, the blockDim is (32,32), i read the nvidia document https://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf, THREAD BLOCK TILE is a subset of threads of a thread block, divided into tiles in row-major order. doesn't it mean thread with the same threadIdx.x will group into a thread_block_tile ?
thanks !!!!

hi, i take a look at the code of column_sum_reduce, i have 2 questions: 1. the goal of column_sum_reduce is to get the column sum of inp matrix with shape[rows, width] and the result shape should be [width],right ? It seems that the judgment condition of pos is not suitable 2. the implementation of cuda kernel based on the asumption that, the thread with same threadIdx.y will group into a thread_block_tile, the blockDim is (32,32), i read the nvidia document https://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf, THREAD BLOCK TILE is a subset of threads of a thread block, divided into tiles in row-major order. doesn't it mean thread with the same threadIdx.x will group into a thread_block_tile ? thanks !!!!

RezaYazdaniAminabadi · 2021-02-28T20:29:54Z

    if (threadIdx.x == 0) {
        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
-        if (pos < (rows * width)) out[pos] = sum;
+        if (pos < width) out[pos] = sum;


Thanks for fixing this! I would say it still was working when the hidden dimension was dividable by 32, however, it would have caused a memory leak for when the hidden is not dividable by 32!

yes! thanks for your approval!!

RezaYazdaniAminabadi · 2021-02-28T20:38:29Z

hi, i take a look at the code of column_sum_reduce, i have 2 questions:

the goal of column_sum_reduce is to get the column sum of inp matrix with shape[rows, width] and the result shape should be [width],right ? It seems that the judgment condition of pos is not suitable

the implementation of cuda kernel based on the asumption that, the thread with same threadIdx.y will group into a thread_block_tile, the blockDim is (32,32), i read the nvidia document https://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf, THREAD BLOCK TILE is a subset of threads of a thread block, divided into tiles in row-major order. doesn't it mean thread with the same threadIdx.x will group into a thread_block_tile ?
thanks !!!!

Hi @zmx19951103

Thanks for fixing this bug. Regarding your second question, I think both x and y dimensions are assigned to different thread_block tiles, however, since this is a 2-dimensional tile, we are just using the the threadIx.y for saving the output after all is reduced across each tile (here ) whose got the same y index and x index changes from 0 to 31. So, what you are saying is also true, and this is also our assumption when reducing the elements in a row!
I hope this answered your question.
Thanks,
Reza

zmxdream requested review from RezaYazdaniAminabadi, ShadenSmith, arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, samyam and tjruwase as code owners February 28, 2021 12:14

zmxdream mentioned this pull request Feb 28, 2021

issue with the implementation of column_sum_reduce #805

Closed

RezaYazdaniAminabadi reviewed Feb 28, 2021

View reviewed changes

Merge branch 'master' into zmx-patch-1

f33e662

RezaYazdaniAminabadi approved these changes Feb 28, 2021

View reviewed changes

RezaYazdaniAminabadi merged commit 937c5ce into deepspeedai:master Feb 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue with the implementation of column_sum_reduce#804

issue with the implementation of column_sum_reduce#804
RezaYazdaniAminabadi merged 2 commits intodeepspeedai:masterfrom
zmxdream:zmx-patch-1

zmxdream commented Feb 28, 2021

Uh oh!

RezaYazdaniAminabadi Feb 28, 2021 •

edited

Loading

Uh oh!

zmxdream Mar 1, 2021

Uh oh!

RezaYazdaniAminabadi commented Feb 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zmxdream commented Feb 28, 2021

Uh oh!

RezaYazdaniAminabadi Feb 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zmxdream Mar 1, 2021

Choose a reason for hiding this comment

Uh oh!

RezaYazdaniAminabadi commented Feb 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RezaYazdaniAminabadi Feb 28, 2021 •

edited

Loading