Refactoring Fusion Executor, pulling out compiled kernel by csarofeen · Pull Request #3468 · NVIDIA/Fuser

csarofeen · 2024-11-24T23:18:31Z

Pull out kernel compilation from the KernelExecutor, trying to separate out the two concepts as we will move towards a world where the execution of a kernel is done through HostIr.

Made CompiledKernel class to hold compilation information in compiled_kernel.h/cpp
Moved code:
- from runtime/executor.h to runtime/compiled_kernel.h
- from runtime/executor.cpp to runtime/compiled_kernel.cpp
- from runtime/executor_utils.cpp to runtime/compiled_kernel.cpp (these are functions only used in compiled_kernel)
- from sys/utils.cpp to runtime/compiled_kernel.cpp (these are functions only used in compiled_kernel)
Moved executor::compileRTC and executor::runRTC into its own class (RtcKernel). It shares compilation logic with CompiledKernel and is in compiled_kernel.h/cpp

Some improvements left for another time:

Don't disable the parameter cache completely when the size of a tensor is a function of an input scalar. I don't think this is particularly common as Thunder is mostly static shapes, but it might be good to support for resize ops.
Merge executor_utils::CudaExecutable and CompiledKernel. I'm not sure if this is the right thing to do, partially just because of RTCKernel and CompiledKernel both own a executor_utils::CudaExecutable

…kernel_2

…d in executor::compile.

…need to run compileRTC and runRTC calls in tests with CompiledKernel instances directly. Also fix profiling calls in KernelExecutor.

…kernel_2

…owering but is now checked after lowering.

csarofeen · 2024-12-15T19:53:12Z

!test

…kernel_2

csarofeen

Quite a few TODO's in this PR, I might not take on all of them in this PR.

csrc/fusion.cpp

csarofeen · 2024-12-22T21:24:21Z

csrc/runtime/compiled_kernel.cpp

+  buffer << cuda_src.rdbuf();
+  return buffer.str();
+}
+} // namespace


Everything above this is only code motion.

csrc/runtime/compiled_kernel.cpp

csrc/runtime/executor_utils.h

csarofeen · 2024-12-22T22:03:56Z

csrc/runtime/executor_utils.h

  int register_spills = -1;
 };

-// Returns executable function and the ptxas log from compilation


Moved to compiled_kernel.cpp

csarofeen · 2024-12-22T22:04:02Z

csrc/runtime/executor_utils.h

    kir::Kernel* kernel,
    ExpressionEvaluator& expr_eval);

-//! Query the target GPU version number NVRTC compiles CUDA kernels for


Moved to compiled_kernel.cpp

Actually these are likely just removed as they should now be contained in runtime/compiled_kernel.cpp

csarofeen · 2024-12-22T22:04:43Z

csrc/sys_utils.cpp

 #include <cstdlib>

 namespace nvfuser {
-


Moved to compiled_kernel.cpp

csarofeen · 2024-12-22T22:04:54Z

csrc/sys_utils.cpp


 namespace nvfuser {

-namespace executor_utils {


Moved to compiled_kernel.cpp

csarofeen · 2024-12-22T22:21:55Z

…kernel_2

csarofeen · 2024-12-25T17:09:03Z

!test

csarofeen · 2024-12-31T15:36:17Z

47 successful checks
https://nv/e2E/130642066

…untime_id, group_id, and device constant in CompiledKernel.

csarofeen · 2025-01-11T21:27:25Z

!test

csarofeen · 2025-01-11T21:40:53Z

!test

csarofeen · 2025-01-12T01:21:33Z

Getting a bunch of thunder failures: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/133353224 I was able to reproduce one of them on main, so uncertain what's going on.

Clang build was the only other test to fail.

Follow up: Thunder failures reproduced on main at this point: #3698

…build.

naoyam

LGTM. The Thunder failures are unlikely to have anything to do with this PR.

csarofeen · 2025-01-13T16:35:50Z

!test

…kernel_2

csarofeen · 2025-01-20T20:32:43Z

!test

github-actions · 2025-01-20T20:38:17Z

PR Reviewer Guide 🔍

(Review updated until commit `107123a`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
⚡ Recommended focus areas for review Potential Bug The code has a potential bug in the `KernelExecutor` class. The `compiled_kernel_` member variable is not checked for null before being used in the `compile` method. This could lead to a segmentation fault if `compiled_kernel_` is null. const LaunchParams& launch_constraints, Code Duplication The code has duplicated logic in the `KernelExecutor` class. The `compile` method has duplicated code for checking the `compile_params.index_type` and setting the `index_type` member variable. This duplicated code can be extracted into a separate method to improve maintainability. } else if (has_cp_async_bulk) { Potential Performance Issue The code has a potential performance issue in the `KernelExecutor` class. The `run` method uses a recursive function call to execute the kernel, which can lead to stack overflow for large inputs. Consider using an iterative approach instead. group_id_);

wujingyue · 2025-01-21T05:06:50Z

csrc/runtime/compiled_kernel.h

+    return lowered_->kernel();
+  }
+
+  Fusion* fusion() const {


I believe this isn't any more useful than kernel() so I removed this in #3725.

wujingyue · 2025-01-21T05:08:54Z

csrc/runtime/compiled_kernel.h

+
+  // function to query whether a `CompiledKernel` has a compiled kernel to
+  // execute
+  bool hasCompiledKernel() const {


https://github.com/NVIDIA/Fuser/pull/3725/files#diff-31a5ef26405804394f573b42c15512e1fb87f930fe7e5bfd95ad4034d867c30fR147 has merged isCompiled and hasCompiledKernel. Having both used to be important for a FusionExecutor which may or may not be a kernel executor -- no longer after your ExecutorAbstract work.

Changed to reflect these changes. (Will push in a minute)

wujingyue · 2025-01-21T05:26:52Z

csrc/runtime/executor.cpp

-  }
+  // Lowered is needed to compute launch parameters as it uses the CA map. We
+  // could modify that, but simply generating that part first.
+  compiled_kernel_ = std::make_unique<CompiledKernel>(


I'm trying to understand the supposed responsibility between KernelExecutor::compile and CompiledKernel::compile. With this PR, KernelExecutor::compile appears to become a thin wrapper of CompiledKernel::compile that merely does profiling and overrides some compilation/launch parameters. Do you plan to get rid of KernelExecutor::compile so FusionKernelRuntime doesn't need to create a KernelExecutor to compile?

I was wondering the same thing, and I think ideally yes. If I recall correctly the challenge comes down to requiring runtime information to compile. It makes it a bit awkward in my opinion as it'd be nice to have a clearer boundary between compilation and runtime, but we need runtime information for compilation. What do you think?

Having a clear boundary between compilation and execution is useful for host IR. Currently, we have the LaunchKernel host IR instruction that runs a compiled KernelExecutor. It should really be CompiledKernel instead because KernelExecutor comes with lots of code and states for compilation that are useless for execution.

Ideally, I'd like the following

a KernelCompiler that takes a fusion segment and compilation parameters (which can be derived from runtime information), schedules the segment, device-lowers it, and generates a CompiledKernel. In this PR, the fact that CompiledKernel also "compiles" is kind of weird, but I'm sure that's due to some practical challenges as you just said.

a LaunchKernel host IR instruction that takes a CompiledKernel and execution parameters and runs it.

KernelExecutor can be the non-host-IR orchestrator that drives the process from KernelCompiler to CompiledKernel to run that kernel. After host IR becomes the only path, we can and probably should get rid of KernelExecutor.

Sounds reasonable, how should we handle needed recompilation under dynamic workloads when there's a heuristic match?

https://github.com/NVIDIA/Fuser/pull/3468/files#diff-4e98fb6577fe11f6854ee230d2ed5952b441fa277061180b00fa1265cf98173eR1527-R1552

Should this also be represented in HostIR? Should we have a "maybe recompile" node?

TIL: I didn't know about KernelExecutor::recompileKernel. AFAICT, it recompiles CUDA with a new block size and/or a new max number of registers, but doesn't re-schedule, re-lower or re-codegen.

I'd keep this "mini-caching" in CompiledKernel. We could represent this recompilation in host IR, but I don't see a practical benefit -- having a "maybe recompile" node doesn't enable more host IR optimization or make the result host program faster. What do you think?

I think your plan is good. I think this PR would help as it's a step towards isolating compilation from execution. I think 1 is doable with your plan. I think you will still need a maybe recompile in the execution somewhere if launch parameters change. Right now it's being called in FusionExecutor::runFusion: https://github.com/NVIDIA/Fuser/blob/main/csrc/runtime/executor.cpp#L1269

The reason we do this is the same reason why someone might use launch_bounds on a cuda kernel. The first time we get a problem we might compile with one register limit in mind based on the number of threads we first compute are needed. If that number goes up because it's input dependent then we may need to recompile otherwise the kernel won't fit on the SM.

…kernel_2

csarofeen · 2025-01-25T21:09:14Z

!test

PR #3468 changed to using `CompiledKernel` and in the shuffle, we switched from using the incoming fusion before compilation to the lowered `kir::Kernel`. This PR just moves the printing to occur just before lowering, inside the constructor for `CompiledKernel`. I believe this is enough to restore the previous behavior. Fixes #3765

…3769) PR #3468 changed to using `CompiledKernel` and in the shuffle, we switched from using the incoming fusion before compilation to the lowered `kir::Kernel`. This PR just moves the printing to occur just before lowering, inside the constructor for `CompiledKernel`. I believe this is enough to restore the previous behavior. Fixes #3765

Generated CUDA files were previously named like `__tmp_kernel_32.cu` and `compare_codegen.sh` matched that pattern when copying kernels. This PR changes that pattern to `__tmp_nvfuser_*.cu` instead, since these filenames were changed in #3468. This should fix the problems seen recently in codediff CI jobs.

Generated CUDA files were previously named like `__tmp_kernel_32.cu` and `compare_codegen.sh` matched that pattern when copying kernels. That broke codediff since these filenames were changed in #3468. This PR fixes `compare_codegen.sh` to match `__tmp_*.cu` instead. It also fixes the outputs of printed PTX and cubin files so that they use the same base filenames: currently that is the previous naming scheme `__tmp_kernel_32.cu` (if using `NVFUSER_ENABLE=static_fusion_count`). This should fix the problems seen recently in codediff CI jobs.

Redraft pulling compiled kernel out of kernel executor.

6bdd7f5

csarofeen mentioned this pull request Nov 24, 2024

Refactoring Fusion Executor, pulling out compiled kernel #3082

Closed

csarofeen added 14 commits November 30, 2024 08:05

Merge branch 'main' of https://github.com/NVIDIA/Fuser into compiled_…

a7fca2e

…kernel_2

Cleanup and preparation to cleanup executor_utils.h

dbb0554

Move compilation logic out of executor_utils into compiled_kernel.

b7c9e7c

Remove compilation profiling from compiled kernel as it's still calle…

b61ac3a

…d in executor::compile.

cleanup

6177db2

Fix input binding in executor.

0e3cdd9

Kernel executor doesn't instantiate compiled kernel unless compiled, …

3961181

…need to run compileRTC and runRTC calls in tests with CompiledKernel instances directly. Also fix profiling calls in KernelExecutor.

Fix type consistency in st matrix testing.

f2b00bb

Fix build.

56dda45

Need to be consistent with types for fusion.manage.

5999480

Merge branch 'main' of https://github.com/NVIDIA/Fuser into compiled_…

5920d34

…kernel_2

Repair serialization.

a7ad429

Fix check that disables parameter cache, the check was valid before l…

d89d155

…owering but is now checked after lowering.

Merge branch 'main' into compiled_kernel_2

2933509

csarofeen added 2 commits December 18, 2024 07:53

Merge branch 'main' of https://github.com/NVIDIA/Fuser into compiled_…

b222303

…kernel_2

Merge.

239d652

csarofeen commented Dec 22, 2024

View reviewed changes

csarofeen added 6 commits December 23, 2024 06:40

Merge branch 'main' of https://github.com/NVIDIA/Fuser into compiled_…

8137228

…kernel_2

Merge conflicts.

a9733b0

Fix lowering hooks, rename compileFusion to compile.

da1452c

Fix param cache check with validation.

e1fcd7f

Remove refactor validation.

b66e6c5

Merge branch 'main' into compiled_kernel_2

7dc8d09

Remove CompileOptions, make scheduler_type, fusion_id, concrete_id, r…

90a76e4

…untime_id, group_id, and device constant in CompiledKernel.

Merge conflicts.

7ac7121

Pass block size to codegen.

2836d60

Revert constexpr-const change in tests/cpp/utils.h, needed for clang …

df789cb

…build.

csarofeen mentioned this pull request Jan 12, 2025

[DO NOT REVIEW] Testing main #3698

Closed

csarofeen requested a review from naoyam January 12, 2025 23:52

naoyam approved these changes Jan 13, 2025

View reviewed changes

Merge branch 'main' into compiled_kernel_2

ffb901f

csarofeen mentioned this pull request Jan 20, 2025

Remove KernelExecutor::fusion_ #3725

Merged

csarofeen added 2 commits January 20, 2025 12:24

Merge branch 'main' of https://github.com/NVIDIA/Fuser into compiled_…

c5939f0

…kernel_2

Fix merge conflicts.

9163101

wujingyue approved these changes Jan 21, 2025

View reviewed changes

csarofeen added 2 commits January 25, 2025 12:19

Merge branch 'main' of https://github.com/NVIDIA/Fuser into compiled_…

177fd0e

…kernel_2

PR Comments.

107123a

csarofeen merged commit ba5c878 into main Jan 26, 2025
51 of 52 checks passed

csarofeen deleted the compiled_kernel_2 branch January 26, 2025 13:21

This was referenced Jan 26, 2025

Build failure (incomplete type of 'nvfuser::KernelExecutor') #3761

Closed

NVFUSER_DUMP=fusion_ir dumps Kernel IR #3765

Closed

jacobhinkle mentioned this pull request Jan 28, 2025

Print scheduled fusion instead of kernel for NVFUSER_DUMP=fusion_ir #3769

Merged

jacobhinkle mentioned this pull request Jan 29, 2025

Accomodate name change of printed kernel files #3778

Merged

Conversation

csarofeen commented Nov 24, 2024 • edited by wujingyue Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csarofeen commented Dec 15, 2024

Uh oh!

csarofeen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csarofeen Dec 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csarofeen commented Dec 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csarofeen commented Dec 25, 2024

Uh oh!

csarofeen commented Dec 31, 2024

Uh oh!

csarofeen commented Jan 11, 2025

Uh oh!

csarofeen commented Jan 11, 2025

Uh oh!

csarofeen commented Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

csarofeen commented Jan 13, 2025

Uh oh!

csarofeen commented Jan 20, 2025

Uh oh!

github-actions bot commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 107123a)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csarofeen commented Jan 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

csarofeen commented Nov 24, 2024 •

edited by wujingyue

Loading

csarofeen Dec 22, 2024 •

edited

Loading

csarofeen commented Dec 22, 2024 •

edited

Loading

csarofeen commented Jan 12, 2025 •

edited

Loading

github-actions bot commented Jan 20, 2025 •

edited

Loading

(Review updated until commit `107123a`)