Skip to content

Conversation

@psrivas2
Copy link
Contributor

@psrivas2 psrivas2 commented Apr 3, 2023

This PR improves cutlass compilation time, by compiling a single CSourceModule instead of creating and compiling one for each kernel.

Creating and compiling a new CSourceModule for every function is quite slow and slows down model with multiple functions offloaded to cutlass quite significantly. Instead we can generate a single CSourceModule and compile it once to produce a single runtime::Module.
This brings down the cutlass compilation time of large models like SD Unet significantly (~30 min to ~4 min). Similar results on other large models.

Testing

tests/python/relax/test_codegen_cutlass.py::test_matmul_offload is broken at HEAD. This PR passes on all other tests when tested locally.

cc @masahi @vinx13

Improve cutlass compilation time, by cmpiling a single CSourceModule
instead of one for each kernel.
@tvm-bot
Copy link
Collaborator

tvm-bot commented Apr 3, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@github-actions github-actions bot requested review from masahi and vinx13 April 3, 2023 14:11
@vinx13 vinx13 merged commit 97ab25c into apache:unity Apr 3, 2023
@masahi
Copy link
Member

masahi commented Apr 3, 2023

The original intention was to compile all generated files in parallel (via NVCC -t flag), but I forgot to actually do it. Have you tested that? I expect that would be faster than this solution.

@psrivas2
Copy link
Contributor Author

psrivas2 commented Apr 3, 2023

The original intention was to compile all generated files in parallel (via NVCC -t flag), but I forgot to actually do it. Have you tested that? I expect that would be faster than this solution.

Could you elaborate what -t flag would do and how would we use it? Loop here processes annotated functions sequentially, so we will still have to parallelize that I think.

I did parallelize this loop to compile the generated C source modules in parallel but that wasn't faster than compiling a single file. The difference between the two was not huge but compiling a single source module was a bit faster (~50 seconds for single source mod vs ~70 seconds for multiple C source mod in parallel).

@masahi
Copy link
Member

masahi commented Apr 3, 2023

-t flag is the number of threads to use for NVCC, https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#threads-number-t. This is used by the Relay BYOC to compile all files in parallel.

kwargs["options"].append("-t %d" % ncpu)

I don't expect NVCC would use multiple threads to compile a huge single source, but the numbers you described sound indeed good.

@masahi
Copy link
Member

masahi commented Apr 3, 2023

Actually, since compile_cutlass_module is also used by the Relax BYOC, I think we are already making use of -t flag. And putting all sources into a single source module is the right solution to really benefit from multi threaded compilation.

auto [f_code, op_headers] = GenCutlassFunc(f, options);
code += "\n" + f_code;
for (const auto& header : op_headers) {
headers += "#include <" + header + ">\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we might be adding duplicated headers. It probably won't matter for compilation speed but the generated file might get ugly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, however since this is a generated file, I felt it is ok to have duplicate entries in header. We can improve upon it in follow up PRs though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants