Refactoring Fusion Executor, pulling out compiled kernel#3468
Refactoring Fusion Executor, pulling out compiled kernel#3468
Conversation
…d in executor::compile.
…need to run compileRTC and runRTC calls in tests with CompiledKernel instances directly. Also fix profiling calls in KernelExecutor.
…owering but is now checked after lowering.
|
!test |
csarofeen
left a comment
There was a problem hiding this comment.
Quite a few TODO's in this PR, I might not take on all of them in this PR.
| buffer << cuda_src.rdbuf(); | ||
| return buffer.str(); | ||
| } | ||
| } // namespace |
There was a problem hiding this comment.
Everything above this is only code motion.
| int register_spills = -1; | ||
| }; | ||
|
|
||
| // Returns executable function and the ptxas log from compilation |
There was a problem hiding this comment.
Moved to compiled_kernel.cpp
| kir::Kernel* kernel, | ||
| ExpressionEvaluator& expr_eval); | ||
|
|
||
| //! Query the target GPU version number NVRTC compiles CUDA kernels for |
There was a problem hiding this comment.
Moved to compiled_kernel.cpp
There was a problem hiding this comment.
Actually these are likely just removed as they should now be contained in runtime/compiled_kernel.cpp
| #include <cstdlib> | ||
|
|
||
| namespace nvfuser { | ||
|
|
There was a problem hiding this comment.
Moved to compiled_kernel.cpp
|
|
||
| namespace nvfuser { | ||
|
|
||
| namespace executor_utils { |
There was a problem hiding this comment.
Moved to compiled_kernel.cpp
|
TODO list:
I'm leaving these to consider in the future.
|
|
!test |
|
47 successful checks |
…untime_id, group_id, and device constant in CompiledKernel.
|
!test |
|
!test |
|
Getting a bunch of thunder failures: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/133353224 I was able to reproduce one of them on main, so uncertain what's going on. Clang build was the only other test to fail. Follow up: Thunder failures reproduced on main at this point: #3698 |
naoyam
left a comment
There was a problem hiding this comment.
LGTM. The Thunder failures are unlikely to have anything to do with this PR.
|
!test |
|
!test |
PR Reviewer Guide 🔍(Review updated until commit 107123a)Here are some key observations to aid the review process:
|
csrc/runtime/compiled_kernel.h
Outdated
| return lowered_->kernel(); | ||
| } | ||
|
|
||
| Fusion* fusion() const { |
There was a problem hiding this comment.
I believe this isn't any more useful than kernel() so I removed this in #3725.
csrc/runtime/compiled_kernel.h
Outdated
|
|
||
| // function to query whether a `CompiledKernel` has a compiled kernel to | ||
| // execute | ||
| bool hasCompiledKernel() const { |
There was a problem hiding this comment.
https://github.com/NVIDIA/Fuser/pull/3725/files#diff-31a5ef26405804394f573b42c15512e1fb87f930fe7e5bfd95ad4034d867c30fR147 has merged isCompiled and hasCompiledKernel. Having both used to be important for a FusionExecutor which may or may not be a kernel executor -- no longer after your ExecutorAbstract work.
There was a problem hiding this comment.
Changed to reflect these changes. (Will push in a minute)
| } | ||
| // Lowered is needed to compute launch parameters as it uses the CA map. We | ||
| // could modify that, but simply generating that part first. | ||
| compiled_kernel_ = std::make_unique<CompiledKernel>( |
There was a problem hiding this comment.
I'm trying to understand the supposed responsibility between KernelExecutor::compile and CompiledKernel::compile. With this PR, KernelExecutor::compile appears to become a thin wrapper of CompiledKernel::compile that merely does profiling and overrides some compilation/launch parameters. Do you plan to get rid of KernelExecutor::compile so FusionKernelRuntime doesn't need to create a KernelExecutor to compile?
There was a problem hiding this comment.
I was wondering the same thing, and I think ideally yes. If I recall correctly the challenge comes down to requiring runtime information to compile. It makes it a bit awkward in my opinion as it'd be nice to have a clearer boundary between compilation and runtime, but we need runtime information for compilation. What do you think?
There was a problem hiding this comment.
Having a clear boundary between compilation and execution is useful for host IR. Currently, we have the LaunchKernel host IR instruction that runs a compiled KernelExecutor. It should really be CompiledKernel instead because KernelExecutor comes with lots of code and states for compilation that are useless for execution.
Ideally, I'd like the following
- a
KernelCompilerthat takes a fusion segment and compilation parameters (which can be derived from runtime information), schedules the segment, device-lowers it, and generates aCompiledKernel. In this PR, the fact thatCompiledKernelalso "compiles" is kind of weird, but I'm sure that's due to some practical challenges as you just said. - a
LaunchKernelhost IR instruction that takes aCompiledKerneland execution parameters and runs it. KernelExecutorcan be the non-host-IR orchestrator that drives the process fromKernelCompilertoCompiledKernelto run that kernel. After host IR becomes the only path, we can and probably should get rid ofKernelExecutor.
There was a problem hiding this comment.
Sounds reasonable, how should we handle needed recompilation under dynamic workloads when there's a heuristic match?
Should this also be represented in HostIR? Should we have a "maybe recompile" node?
There was a problem hiding this comment.
TIL: I didn't know about KernelExecutor::recompileKernel. AFAICT, it recompiles CUDA with a new block size and/or a new max number of registers, but doesn't re-schedule, re-lower or re-codegen.
I'd keep this "mini-caching" in CompiledKernel. We could represent this recompilation in host IR, but I don't see a practical benefit -- having a "maybe recompile" node doesn't enable more host IR optimization or make the result host program faster. What do you think?
There was a problem hiding this comment.
I think your plan is good. I think this PR would help as it's a step towards isolating compilation from execution. I think 1 is doable with your plan. I think you will still need a maybe recompile in the execution somewhere if launch parameters change. Right now it's being called in FusionExecutor::runFusion: https://github.com/NVIDIA/Fuser/blob/main/csrc/runtime/executor.cpp#L1269
The reason we do this is the same reason why someone might use launch_bounds on a cuda kernel. The first time we get a problem we might compile with one register limit in mind based on the number of threads we first compute are needed. If that number goes up because it's input dependent then we may need to recompile otherwise the kernel won't fit on the SM.
|
!test |
PR #3468 changed to using `CompiledKernel` and in the shuffle, we switched from using the incoming fusion before compilation to the lowered `kir::Kernel`. This PR just moves the printing to occur just before lowering, inside the constructor for `CompiledKernel`. I believe this is enough to restore the previous behavior. Fixes #3765
…3769) PR #3468 changed to using `CompiledKernel` and in the shuffle, we switched from using the incoming fusion before compilation to the lowered `kir::Kernel`. This PR just moves the printing to occur just before lowering, inside the constructor for `CompiledKernel`. I believe this is enough to restore the previous behavior. Fixes #3765
Generated CUDA files were previously named like `__tmp_kernel_32.cu` and `compare_codegen.sh` matched that pattern when copying kernels. This PR changes that pattern to `__tmp_nvfuser_*.cu` instead, since these filenames were changed in #3468. This should fix the problems seen recently in codediff CI jobs.
Generated CUDA files were previously named like `__tmp_kernel_32.cu` and `compare_codegen.sh` matched that pattern when copying kernels. That broke codediff since these filenames were changed in #3468. This PR fixes `compare_codegen.sh` to match `__tmp_*.cu` instead. It also fixes the outputs of printed PTX and cubin files so that they use the same base filenames: currently that is the previous naming scheme `__tmp_kernel_32.cu` (if using `NVFUSER_ENABLE=static_fusion_count`). This should fix the problems seen recently in codediff CI jobs.
Pull out kernel compilation from the
KernelExecutor, trying to separate out the two concepts as we will move towards a world where the execution of a kernel is done throughHostIr.CompiledKernelclass to hold compilation information incompiled_kernel.h/cppruntime/executor.htoruntime/compiled_kernel.hruntime/executor.cpptoruntime/compiled_kernel.cppruntime/executor_utils.cpptoruntime/compiled_kernel.cpp(these are functions only used incompiled_kernel)sys/utils.cpptoruntime/compiled_kernel.cpp(these are functions only used incompiled_kernel)executor::compileRTCandexecutor::runRTCinto its own class (RtcKernel). It shares compilation logic withCompiledKerneland is incompiled_kernel.h/cppSome improvements left for another time:
executor_utils::CudaExecutableandCompiledKernel. I'm not sure if this is the right thing to do, partially just because of RTCKernel and CompiledKernel both own aexecutor_utils::CudaExecutable