issue with the implementation of column_sum_reduce 

hi, i take a look at the code of column_sum_reduce, i have 2 questions:

the goal of column_sum_reduce is to get the column sum of inp matrix with shape[rows, width] and the result shape should be [width],right ? It seems that the judgment condition of pos is not suitable(https://github.com/microsoft/DeepSpeed/pull/804#issue-581505804)
the implementation of cuda kernel based on the asumption that, the thread with same threadIdx.y will group into a thread_block_tile, the blockDim is (32,32), i read the nvidia document https://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf, THREAD BLOCK TILE is a subset of threads of a thread block, divided into tiles in row-major order. doesn't it mean thread with the same threadIdx.x will group into a thread_block_tile ?
thanks !!!!

thanks!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue with the implementation of column_sum_reduce #805

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

issue with the implementation of column_sum_reduce #805

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions