Remove ExpressionEvaluator from KernelExecutor::run#3952
Conversation
|
Review updated until commit c23ca58 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test |
…ason it's not synced correctly.
|
!test |
|
!test |
|
Besides clang build failures, all tests are passing here. |
csrc/runtime/executor.cpp
Outdated
| buffer_info.shape_info.unsharded_logical_sizes.size() | ||
| ? buffer_info.shape_info.unsharded_logical_sizes | ||
| : buffer_info.shape_info.logical_sizes; | ||
| const auto& alloc_stride = buffer_info.shape_info.allocation_strides; |
There was a problem hiding this comment.
Turning args into binaries is why I added input global buffer infos.
csrc/runtime/executor.cpp
Outdated
|
|
||
| // Initialize the executor entry if not initlized | ||
| if (!executor_entry->init) { | ||
| // std::cout << "Initializing executor entry" << std::endl; |
| "Expected blockDim.x >= 32 but found ", | ||
| launch_params.bdimx()); | ||
|
|
||
| std::vector<GlobalBufferInfo> input_info; |
There was a problem hiding this comment.
Construct input buffer info's as they're used to construct the binary args for the kernel.
|
|
||
| // Reset the arguments that we'll pass to cuLaunchKernel. This needs to be | ||
| // invoked on every shape change. | ||
| void KernelExecutor::recomputeArgs( |
There was a problem hiding this comment.
Compute Args is <1us, this isn't a worthwhile optimization at this point.
csrc/host_ir/executor.cpp
Outdated
| std::vector<GlobalBufferInfo> output_info = getBufferInfos( | ||
| std::vector<GlobalBufferInfo> output_infos = getBufferInfos( | ||
| expr_eval, PrimDataType::Int, host_ir_container_->outputs()); | ||
| output_args.resize(host_ir_container_->outputs().size()); |
There was a problem hiding this comment.
Yes, thank you for catching.
| // Resolving TMA requires binding all values and evaluating the TMA | ||
| // arguments | ||
| args = resolveTMA(*executor_entry, args); | ||
| } else if (has_rng_) { |
There was a problem hiding this comment.
So we can not use TMA and RNG in the same kernel?
There was a problem hiding this comment.
resolveTMA will resolve RNG values too. resolveTMA actually resolves just about any argument that's missing values
There was a problem hiding this comment.
In the future I'd like to cleanup the resolution of missing params to only bind necessary inputs for their evaluation. That would help unify the different runtime resolutions necessary (dynamic alias, rng, tma, and anything else we come up with). I'd similarly like to do something like this for expression evaluator executor/host ir.
There was a problem hiding this comment.
This makes sense. For now, could you update the comment in code saying that this also resolves RNG? Otherwise this looks like a bug.
jacobhinkle
left a comment
There was a problem hiding this comment.
Only superficial comments so far.
csrc/runtime/allocations.cpp
Outdated
| DataType index_dtype, | ||
| const std::vector<Val*>& fusion_outputs) { | ||
| FUSER_PERF_SCOPE("fusion_executor::allocations::getOutbufferInfo"); | ||
| FUSER_PERF_SCOPE("fusion_executor::allocations::getOutbufferInfos"); |
There was a problem hiding this comment.
This function is actually called getBufferInfos
| // slice, and that slice has a symbolic integer it depends on, then this | ||
| // function returns true. | ||
| // | ||
| // This could happen for other examples and this function will return true if | ||
| // to evaluate the output that has an alias, other values besides the aliased | ||
| // input need to be bound to the expression evaluator to evaluate the output. |
There was a problem hiding this comment.
Nit: this function returns true should be we set has_dynamic_alias_ to true
There was a problem hiding this comment.
Thanks, took it out to a function then reverted it and didn't remember to update the comment.
csrc/runtime/executor.cpp
Outdated
| } | ||
| } // namespace | ||
|
|
||
| // TODO: Reduce bindings to only those necessaary to resolve missing params. |
There was a problem hiding this comment.
| // TODO: Reduce bindings to only those necessaary to resolve missing params. | |
| // TODO: Reduce bindings to only those necessary to resolve missing params. |
csrc/runtime/executor.cpp
Outdated
| continue; | ||
| } | ||
| } | ||
| if (out_tv->isConst()) { |
There was a problem hiding this comment.
I think isConst on a tensor is always false. What is this part supposed to do?
There was a problem hiding this comment.
Good catch! This should be input->isConst(), not out_tv->isConst.
There was a problem hiding this comment.
It should be in the loop above.
csrc/runtime/allocations.cpp
Outdated
| auto alias_info = tv->fusion()->getOutputAlias(tv); | ||
| if (alias_info.type == AllocationType::Evaluate) { | ||
| auto val = expr_eval.evaluate(tv); | ||
| NVF_ERROR(val.hasValue() && val.is<at::Tensor>(), "Output is not a Tensor"); |
There was a problem hiding this comment.
nit:the hasValue check is redundant.
| NVF_ERROR(val.hasValue() && val.is<at::Tensor>(), "Output is not a Tensor"); | |
| NVF_ERROR(val.is<at::Tensor>(), "Output is not a Tensor"); |
There was a problem hiding this comment.
I didn't realize that, thanks.
There was a problem hiding this comment.
Removed redundant check.
csrc/runtime/compiled_kernel.cpp
Outdated
| NVF_ERROR( | ||
| input.is<at::Tensor>() && input.as<at::Tensor>().is_cuda(), | ||
| "Only CUDA tensors are supported for direct nvRTC launches at this time."); | ||
| if (input.is<at::Tensor>() && input.as<at::Tensor>().is_cuda()) { |
There was a problem hiding this comment.
nit: This check is redundant
csrc/runtime/executor.cpp
Outdated
| info.shape_info.allocation_sizes = sizes; | ||
| info.shape_info.allocation_strides = strides; | ||
| info.shape_info.logical_sizes = sizes; | ||
| info.shape_info.logical_strides = strides; |
There was a problem hiding this comment.
Instead of making a copy of sizes to allocation_sizes and logical_sizes, could we just make it empty, and say that "if allocation size is empty, then it is equal to logical size", this way we can save a copy.
If you believe this micro-optimization is unimportant, at least we can replace the last line as:
info.shape_info.logical_sizes = std::move(sizes);
info.shape_info.logical_strides = std::move(strides);There was a problem hiding this comment.
I think it's a reasonable approach, that's what I did it for the unsharded logical sizes. If it's quick enough I'll do in this PR, if not I'll leave it to the future.
| out_tensors[out_idx] = inp; | ||
| } else if ( | ||
| fusion->getOutputAlias(out_info.tv).type == AllocationType::Evaluate) { | ||
| if (dynamic_evaluate) { |
There was a problem hiding this comment.
I understood this PR tries to to avoid expression evaluations from the host path, which is great. But I couldn't understand what this flag gives us.
AFAICT from https://github.com/NVIDIA/Fuser/pull/3952/files#diff-fc385708d95b79cfddb58136479569b5d0aa608458bcb19759afedb788f308ddR1081, with this PR, the logic is
if the fusion has dynamic aliases:
Skip evaluating dynamic aliases in allocateOutputs
Instead, evaluate them after allocateOutputs
else:
Allocate/evaluate for all outputs in allocateOutputs
# Because there are no dynamic aliases, we are sure this won't call ExpressionEvaluator
Can we simply set dynamic_alias always to true and do the if-then branch in the callers of allocateOutputs? This way, dynamic aliases (aka AllocateType::Evaluate) are always computed outside allocateOutputs.
Alternatively, can we simply set dynamic_alias to false and do the if-else branch in the callers? This way, dynamic aliases are always computed inside allocateOutputs, as before. I don't see much difference between evaluating dynamic aliases inside vs outside. They both call ExpressionEvaluator to bind inputs and evaluate outputs.
Note: HostIrExecutor is for communication only, so the fusion it runs always has (or at least should have) AllocateType::New.
There was a problem hiding this comment.
Just to make sure I follow correctly the logic in KernelExecutor::run is:
allocate outputs (dynamic alias)
Allocate new buffers
Alias reuse buffers
if(!dynamic alias)
bind one input alias
evaluate output alias
if dynamic alias
bind all inputs
evaluate any missing outputs
I'm not sure if I understood correctly, so apologies if I'm missing something, but I think the reason we send dynamic alias to allocate outputs is because if it's not true we can drastically simplify what needs to be bound to expression evaluator to infer the outputs. With more work dynamic alias could bind only what's necessary, but today when dynamic alias we bind everything. When non-dynamic alias we still need to bind something, but I special cased on only binding the input which I believe is far more efficient than binding all inputs and all sizes of each input.
This may be an uncommon special case, though I hope in the future we can improve the amount of bindings and evaluations for dynamic cases.
There was a problem hiding this comment.
Are you suggesting we lift the if(!dynamic alias) case to executor.cpp? This would move all allocation::evaluate logic to be together, maybe that's what you're suggesting would help the logic.
There was a problem hiding this comment.
If that's true for host ir executor we could remove the extra processing logic there. Do you think it will stay that way?
There was a problem hiding this comment.
Thanks for clarifying! I realized I missed several important details:
- AllocateType::Evaluate includes more than "dynamic aliases". For example, ExprEvalScheduler marks matmul outputs as
::Evaluatewhich alias to nothing. - You believe that binding only input aliases is much faster than binding all inputs.
- Even when an output is
::Evaluateand has an aliased_io, computing it may require binding inputs other than aliased_io.
Based on these new understandings, I can't think of a simpler way than the current PR without significant changes.
One thing I might consider in the future is to set alias_io only when binding just that is sufficient to compute the output. This way, we don't need a separate dynamic_alias parameter and the logic can be simplified to
allocate outputs
Allocate new buffers
Alias reuse buffers
For each dynamic alias
if aliased_io is null:
set output to monostate
else:
bind one input alias
evaluate output alias
If any output is missing (i.e. monostate):
bind all inputs
evaluate any missing outputs
Anyhow, thanks for going through the convoluted existing logic here! Most of it came from me trying to support meta ops.
There was a problem hiding this comment.
If that's true for host ir executor we could remove the extra processing logic there.
Yes, I believe https://github.com/NVIDIA/Fuser/pull/3952/files#diff-6f15ae1df14ce7b6b049b46305a6b7991731bdcd859fba15a09b63adb77ca670R151-R161 is unnecessary.
Do you think it will stay that way?
Yes. I don't plan to use HostIrExecutor more than launching communications. The ongoing host IR integration happens in HostIrEvaluator (not HostIrExecutor). I believe we will run into the same latency issue with ExpressionEvaluator when we try to turn on host IR by default. Yikes!
|
!test |
|
!build |
Removes as much expression evaluation as possible on matching inputs to KernelExecutor. Results for llama based latency tests on H200-DGX are tracked in #3813. In that PR there are also CPU based profiling results showing how much latency has been improved for KernelExecutor. This PR does not add any new functionality.
Graph 0:
Total time: 12ms -> 3.1ms
KernelExecutor::runFusion: 35.5us -> 10.5us.
Graph 1:
Total time: 19.8ms -> 10.6ms
KernelExecutor::runFusion: 36.4us -> 13.1us
Graph 2:
Total time: 18.9ms -> 18.8ms
KernelExecutor::runFusion: 28.6 us -> 11.1us
For Graph 2 we would need to improve ExprEvalExecutor as it's taking up 60% of the runtime and Kernel Executor only 20%.