Support parallel split K mode for porfiling#277
Support parallel split K mode for porfiling#277hwu36 merged 5 commits intoNVIDIA:masterfrom Peter9606:parallel_profiling_support
Conversation
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
|
@manishucsd and @kerrmudgeon , would you please help @Peter9606 here? Maybe point him to the conv parallel splitk code? |
|
Hi @Peter9606 , I have reviewed your code. It looks like you have followed the changes in convolutions to support parallel reductions. A few points to consider:
Thanks! |
I finally get it, thank you very much! |
1. find gemm kernel by preference key 2. switch m n for redution kernel Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
|
@manishucsd Now parallel split K reduction profiling works for SMIT fp32, but only through new added reduction kernel with smaller alignment. Still not very clear why larger alignment reduction kernel cannot be selected. |
|
Seems that this version can have a successful run for fp16 with align1/2/4. |
|
This PR has been labeled |
|
Now it can support
It still requires the 128bit alignment in the reduction. In this PR, I removed small aligned reduction code which requires some extra logic to find the correct reduction configuration. To do this, we need to put alignment into the FunctionalKey or PreferenceKey of the reduction operation and use problem_size.m to decide the correct the reduction kernel to use. We welcome the community to extend this PR to support small alignment reduction. |
* Support parallel split K mode for porfiling Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * Parallel Split K support 1. find gemm kernel by preference key 2. switch m n for redution kernel Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * parallel splitk for fp16 gemm * add one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
This PR fixes an issue in recent DPCPP nightly versions that require the libhwloc15 package.
I'm trying to add support for parallel profiling, and this patch is what I modified.
Unfortunately, it only works for very small portion of problem sizes whose
mshould equaln. Also, to make it workable, I have to hard code the number of elements computed per operation during epilogue to 1 which is obviously not correct. Hope someone can correct it.A command line sample if anyone want to try it
./cutlass_profiler --split_k_slices=2 --m=242 --n=242 --k=300 --split_k_mode=parallel