-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[Unity][BYOC] Faster cutlass codegen #14465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Improve cutlass compilation time, by cmpiling a single CSourceModule instead of one for each kernel.
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
|
The original intention was to compile all generated files in parallel (via NVCC |
Could you elaborate what I did parallelize this loop to compile the generated C source modules in parallel but that wasn't faster than compiling a single file. The difference between the two was not huge but compiling a single source module was a bit faster (~50 seconds for single source mod vs ~70 seconds for multiple C source mod in parallel). |
|
tvm/python/tvm/contrib/cutlass/build.py Line 75 in 5562d90
I don't expect NVCC would use multiple threads to compile a huge single source, but the numbers you described sound indeed good. |
|
Actually, since |
| auto [f_code, op_headers] = GenCutlassFunc(f, options); | ||
| code += "\n" + f_code; | ||
| for (const auto& header : op_headers) { | ||
| headers += "#include <" + header + ">\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we might be adding duplicated headers. It probably won't matter for compilation speed but the generated file might get ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, however since this is a generated file, I felt it is ok to have duplicate entries in header. We can improve upon it in follow up PRs though.
This PR improves cutlass compilation time, by compiling a single CSourceModule instead of creating and compiling one for each kernel.
Creating and compiling a new CSourceModule for every function is quite slow and slows down model with multiple functions offloaded to cutlass quite significantly. Instead we can generate a single CSourceModule and compile it once to produce a single
runtime::Module.This brings down the cutlass compilation time of large models like SD Unet significantly (~30 min to ~4 min). Similar results on other large models.
Testing
tests/python/relax/test_codegen_cutlass.py::test_matmul_offloadis broken at HEAD. This PR passes on all other tests when tested locally.cc @masahi @vinx13