Support parallel split K mode for porfiling by Peter9606 · Pull Request #277 · NVIDIA/cutlass

Peter9606 · 2021-06-11T00:49:00Z

I'm trying to add support for parallel profiling, and this patch is what I modified.
Unfortunately, it only works for very small portion of problem sizes whose m should equal n. Also, to make it workable, I have to hard code the number of elements computed per operation during epilogue to 1 which is obviously not correct. Hope someone can correct it.

A command line sample if anyone want to try it
./cutlass_profiler --split_k_slices=2 --m=242 --n=242 --k=300 --split_k_mode=parallel

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

hwu36 · 2021-06-11T01:09:10Z

@manishucsd and @kerrmudgeon , would you please help @Peter9606 here? Maybe point him to the conv parallel splitk code?

manishucsd · 2021-06-14T17:34:50Z

Hi @Peter9606 , I have reviewed your code. It looks like you have followed the changes in convolutions to support parallel reductions.

A few points to consider:

Can you check that everything for the reduction kernel is initialized properly.
Also, check at the dispatch point from the profiler to the actual kernel. Use these printfs.
Make sure the reduction kernel you are trying to use is instantiated and compiled into the CUTLASS library.

Reduction kernels are manually instanced in reduction_device.cu
It currently has the largest aligned kernels. If you need smaller alignments you will need to instance them and not overwrite them.
Also, to use smaller alignment you will need to add something similar to GemmPreferenceKey which handles alignment. Note that ReductionFunctionalKey will give the list of all kernels that match the functional requirement. There could be many possible alignments that functionally satisfy the problem size. It will pick the largest possible alignment from a functionally equivalent kernel set.
I recommend making Gemm parallel split-k work for the largest alignment first. Make it work for F16 align8 kernels first.

I find something missing for GEMM parallel reduction change. See this part (

cutlass/tools/profiler/src/conv2d_operation_profiler.cu

Line 1208 in 3584270

// initialize conv2d underlying operation to handle parallel reduction

)

For F16 output and F32 accumulation, Gemm + Parallel reduction will change the Gemm kernel you need to call. Instead of calling Gemm with F16 output, now Gemm writes output in F32, and reduction kernel writes it F16 (F32->F16).
Try to make your changes work for F16 accumulation and F16 output for the largest possible alignment first and then go from there.

Thanks!

tools/library/src/gemm_operation.h

tools/library/src/reduction/reduction_device.cu

tools/profiler/src/gemm_operation_profiler.cu

Peter9606 · 2021-06-16T06:20:16Z

* For F16 output and F32 accumulation, Gemm + Parallel reduction will change the Gemm kernel you need to call. Instead of calling Gemm with F16 output, now Gemm writes output in F32, and reduction kernel writes it F16 (F32->F16).

I finally get it, thank you very much!

1. find gemm kernel by preference key 2. switch m n for redution kernel Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

Peter9606 · 2021-06-16T11:56:30Z

@manishucsd Now parallel split K reduction profiling works for SMIT fp32, but only through new added reduction kernel with smaller alignment. Still not very clear why larger alignment reduction kernel cannot be selected.

Peter9606 · 2021-06-24T02:07:47Z

Seems that this version can have a successful run for fp16 with align1/2/4.

github-actions · 2021-12-31T18:05:29Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

hwu36 · 2022-01-27T15:32:42Z

Now it can support

fp32 gemm output, fp16 reduction output
fp32 gemm output, fp32 reduction output

It still requires the 128bit alignment in the reduction. In this PR, I removed small aligned reduction code which requires some extra logic to find the correct reduction configuration. To do this, we need to put alignment into the FunctionalKey or PreferenceKey of the reduction operation and use problem_size.m to decide the correct the reduction kernel to use. We welcome the community to extend this PR to support small alignment reduction.

* Support parallel split K mode for porfiling Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * Parallel Split K support 1. find gemm kernel by preference key 2. switch m n for redution kernel Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * parallel splitk for fp16 gemm * add one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

This PR fixes an issue in recent DPCPP nightly versions that require the libhwloc15 package.

Support parallel split K mode for porfiling

3584270

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

manishucsd reviewed Jun 14, 2021

View reviewed changes

tools/library/src/gemm_operation.h Outdated Show resolved Hide resolved

tools/library/src/reduction/reduction_device.cu Outdated Show resolved Hide resolved

tools/profiler/src/gemm_operation_profiler.cu Show resolved Hide resolved

Parallel Split K support

55ce092

1. find gemm kernel by preference key 2. switch m n for redution kernel Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

mnicely added this to the CUTLASS 2.9 milestone Dec 1, 2021

github-actions bot added the inactive-30d label Dec 31, 2021

mnicely requested a review from hwu36 January 19, 2022 15:45

hwu36 added 3 commits January 26, 2022 20:55

Merge remote-tracking branch 'origin' into parallel_profiling_support

a396094

parallel splitk for fp16 gemm

4e1e524

add one missing file

6facf82

hwu36 approved these changes Jan 27, 2022

View reviewed changes

hwu36 merged commit 1e4703c into NVIDIA:master Jan 27, 2022

Peter9606 deleted the parallel_profiling_support branch January 28, 2022 00:32

sanchitintel pushed a commit to sanchitintel/cutlass that referenced this pull request Oct 20, 2025

Install dependency for nightly DPCPP (NVIDIA#277)

8eb0c78

This PR fixes an issue in recent DPCPP nightly versions that require the libhwloc15 package.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parallel split K mode for porfiling#277

Support parallel split K mode for porfiling#277
hwu36 merged 5 commits intoNVIDIA:masterfrom
Peter9606:parallel_profiling_support

Peter9606 commented Jun 11, 2021 •

edited

Loading

Uh oh!

hwu36 commented Jun 11, 2021

Uh oh!

manishucsd commented Jun 14, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Peter9606 commented Jun 16, 2021

Uh oh!

Peter9606 commented Jun 16, 2021

Uh oh!

Peter9606 commented Jun 24, 2021

Uh oh!

github-actions bot commented Dec 31, 2021

Uh oh!

hwu36 commented Jan 27, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Peter9606 commented Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hwu36 commented Jun 11, 2021

Uh oh!

manishucsd commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Peter9606 commented Jun 16, 2021

Uh oh!

Peter9606 commented Jun 16, 2021

Uh oh!

Peter9606 commented Jun 24, 2021

Uh oh!

github-actions bot commented Dec 31, 2021

Uh oh!

hwu36 commented Jan 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Peter9606 commented Jun 11, 2021 •

edited

Loading

manishucsd commented Jun 14, 2021 •

edited

Loading

hwu36 commented Jan 27, 2022 •

edited

Loading