Skip to content

Remove ExpressionEvaluator from KernelExecutor::run#3952

Merged
csarofeen merged 40 commits intomainfrom
executor_cleanup
Mar 5, 2025
Merged

Remove ExpressionEvaluator from KernelExecutor::run#3952
csarofeen merged 40 commits intomainfrom
executor_cleanup

Conversation

@csarofeen
Copy link
Collaborator

@csarofeen csarofeen commented Feb 24, 2025

Removes as much expression evaluation as possible on matching inputs to KernelExecutor. Results for llama based latency tests on H200-DGX are tracked in #3813. In that PR there are also CPU based profiling results showing how much latency has been improved for KernelExecutor. This PR does not add any new functionality.

Graph 0:
Total time: 12ms -> 3.1ms
KernelExecutor::runFusion: 35.5us -> 10.5us.

Graph 1:
Total time: 19.8ms -> 10.6ms
KernelExecutor::runFusion: 36.4us -> 13.1us

Graph 2:
Total time: 18.9ms -> 18.8ms
KernelExecutor::runFusion: 28.6 us -> 11.1us

For Graph 2 we would need to improve ExprEvalExecutor as it's taking up 60% of the runtime and Kernel Executor only 20%.

@github-actions
Copy link

github-actions bot commented Feb 24, 2025

Review updated until commit c23ca58

Description

  • Removed expression evaluation for matching inputs in KernelExecutor::run.

  • Updated allocation logic to handle outputs with aliases and dynamic evaluation.

  • Refactored tensor argument handling and serialization in KernelExecutor.

  • Improved error handling for unsupported aliasing scenarios.


Changes walkthrough 📝

Relevant files
Enhancement
13 files
matmul.cpp
Updated KernelExecutor::compile to use KernelArgumentHolder directly.
+3/-6     
executor.cpp
Removed expression evaluation for outputs and improved output
allocation.
+43/-23 
allocations.cpp
Refactored allocation logic and removed intermediate allocation
handling.
+120/-158
compiled_kernel.cpp
Updated argument handling for CUDA tensors.                           
+9/-16   
executor.cpp
Enhanced argument binding and added support for dynamic aliases and
TMA.
+414/-213
executor_kernel_arg.cpp
Removed toC10Array and updated tensor argument serialization.
+55/-80 
executor_utils.cpp
Added function to get output alias to input mapping.         
+38/-0   
tensor_metadata.cpp
Updated metadata handling to use TensorShapeInfo.               
+3/-1     
allocations.h
Updated GlobalBufferInfo and added TensorShapeInfo struct.
+28/-18 
executor.h
Updated KernelExecutorEntry and added fields for alias handling.
+58/-44 
executor_kernel_arg.h
Updated KernelArgumentHolder and added tensorToBytes function.
+19/-8   
executor_utils.h
Added function to get output alias to input mapping.         
+5/-0     
fusion_cache.fbs
Updated flatbuffer schemas for GlobalBufferInfo and
KernelExecutorEntry.
+16/-7   
Tests
1 files
test_schedule_ops.py
Disabled test_matmul_auto_scheduler due to scheduler changes.
+3/-0     

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review

Possible Issue

The resolveTMA function does not handle cases where the number of arguments does not match the number of inputs and outputs. This could lead to incorrect bindings or missing arguments.

KernelArgumentHolder KernelExecutor::resolveTMA(
    KernelExecutorEntry& entry,
    const KernelArgumentHolder& args) const {
  ExpressionEvaluator expr_eval;
  int64_t arg_idx = 0;
  NVF_ERROR(
      entry.inputs.size() == compiled_kernel_->kernel()->inputs().size(),
      "Input size mismatch");
  for (auto inp_idx : c10::irange(entry.inputs.size())) {
    expr_eval.bind(
        compiled_kernel_->kernel()->inputs()[inp_idx], args[arg_idx++]);
  }

  NVF_ERROR(
      entry.outputs.size() == compiled_kernel_->kernel()->outputs().size(),
      "Output size mismatch");
  for (auto out_idx : c10::irange(entry.outputs.size())) {
    expr_eval.bind(
        compiled_kernel_->kernel()->outputs()[out_idx], args[arg_idx++]);
  }

  for (const auto& intermediate_entry : entry.intermediates) {
    if (args[arg_idx].hasValue()) {
      expr_eval.bind(intermediate_entry.tv, args[arg_idx++]);
    }
  }

  KernelArgumentHolder resolved_args;
  for (auto param : compiled_kernel_->kernel()->parameters()) {
    resolved_args.push(expr_eval.evaluate(param));
  }
  return resolved_args;
}
Code Complexity

The allocateOutputs function has become quite complex with multiple conditional paths. It might be beneficial to refactor it into smaller, more manageable functions to improve readability and maintainability.

KernelArgumentHolder allocateOutputs(
    const Fusion* fusion,
    const std::vector<GlobalBufferInfo>& output_infos,
    const std::vector<int>& output_alias_to_input_map,
    const c10::Device& device,
    const KernelArgumentHolder& args,
    bool dynamic_evaluate) {
  FUSER_PERF_SCOPE("fusion_executor::allocations::allocateOutputs");

  KernelArgumentHolder out_tensors;
  out_tensors.resize(output_infos.size());
  for (auto out_idx : c10::irange(output_infos.size())) {
    auto out_info = output_infos.at(out_idx);
    if (output_alias_to_input_map.at(out_idx) == -1) {
      auto alloc_tensor = at::native::empty_strided_cuda(
          out_info.shape_info.logical_sizes,
          out_info.shape_info.logical_strides,
          out_info.type,
          c10::nullopt,
          device,
          c10::nullopt);
      if (shouldFillAllocationWithNan()) {
        fillTensorWithNan(alloc_tensor);
      }
      out_tensors[out_idx] = alloc_tensor;
    } else if (
        fusion->getOutputAlias(out_info.tv).type ==
        AllocationType::ReuseBuffer) {
      const auto& inp = args[output_alias_to_input_map.at(out_idx)];
      NVF_ERROR(inp.is<at::Tensor>(), "Input is not a Tensor");
      out_tensors[out_idx] = inp;
    } else if (
        fusion->getOutputAlias(out_info.tv).type == AllocationType::Evaluate) {
      if (dynamic_evaluate) {
        out_tensors[out_idx] = std::monostate();
        continue;
      }

      ExpressionEvaluator ee;
      ee.bind(
          fusion->getOutputAlias(out_info.tv).aliased_io,
          args[output_alias_to_input_map.at(out_idx)]);
      out_tensors[out_idx] = ee.evaluate(out_info.tv);
    } else {
      NVF_THROW(
          "Unexpected allocation path, internal logic around allocations must be incorrect.");
    }
  }
Deprecated Function

The polymorphicValueToBytes function is still present but marked for deprecation in favor of tensorToBytes. Ensure that all calls to polymorphicValueToBytes for CUDA tensors are replaced with tensorToBytes.

      for (const auto& field : dtype_.fields) {
        if (!field.used_in_kernel) {
          continue;
        }
        auto field_data = polymorphicValueToBytes(
            argument->*field.name, *field.type, index_type);
        buffer.insert(buffer.end(), field_data.begin(), field_data.end());
      }
      return buffer;
    }
  } else if (argument.is<Opaque>()) {
    return argument.as<Opaque>().bytes();
  } else {
    NVF_THROW(
        "Cannot convert ", argument.type().name(), " to kernel argument data.");
  }
}

std::vector<std::byte> tensorToBytes(

@csarofeen
Copy link
Collaborator Author

!test

@csarofeen
Copy link
Collaborator Author

!test

@csarofeen
Copy link
Collaborator Author

!test

@csarofeen
Copy link
Collaborator Author

Besides clang build failures, all tests are passing here.

buffer_info.shape_info.unsharded_logical_sizes.size()
? buffer_info.shape_info.unsharded_logical_sizes
: buffer_info.shape_info.logical_sizes;
const auto& alloc_stride = buffer_info.shape_info.allocation_strides;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turning args into binaries is why I added input global buffer infos.


// Initialize the executor entry if not initlized
if (!executor_entry->init) {
// std::cout << "Initializing executor entry" << std::endl;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup

"Expected blockDim.x >= 32 but found ",
launch_params.bdimx());

std::vector<GlobalBufferInfo> input_info;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Construct input buffer info's as they're used to construct the binary args for the kernel.


// Reset the arguments that we'll pass to cuLaunchKernel. This needs to be
// invoked on every shape change.
void KernelExecutor::recomputeArgs(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compute Args is <1us, this isn't a worthwhile optimization at this point.

@csarofeen csarofeen marked this pull request as ready for review March 3, 2025 16:55
std::vector<GlobalBufferInfo> output_info = getBufferInfos(
std::vector<GlobalBufferInfo> output_infos = getBufferInfos(
expr_eval, PrimDataType::Int, host_ir_container_->outputs());
output_args.resize(host_ir_container_->outputs().size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line a no-op?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you for catching.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

// Resolving TMA requires binding all values and evaluating the TMA
// arguments
args = resolveTMA(*executor_entry, args);
} else if (has_rng_) {
Copy link
Collaborator

@zasdfgbnm zasdfgbnm Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can not use TMA and RNG in the same kernel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveTMA will resolve RNG values too. resolveTMA actually resolves just about any argument that's missing values

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future I'd like to cleanup the resolution of missing params to only bind necessary inputs for their evaluation. That would help unify the different runtime resolutions necessary (dynamic alias, rng, tma, and anything else we come up with). I'd similarly like to do something like this for expression evaluator executor/host ir.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. For now, could you update the comment in code saying that this also resolves RNG? Otherwise this looks like a bug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only superficial comments so far.

DataType index_dtype,
const std::vector<Val*>& fusion_outputs) {
FUSER_PERF_SCOPE("fusion_executor::allocations::getOutbufferInfo");
FUSER_PERF_SCOPE("fusion_executor::allocations::getOutbufferInfos");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is actually called getBufferInfos

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

Comment on lines +299 to +304
// slice, and that slice has a symbolic integer it depends on, then this
// function returns true.
//
// This could happen for other examples and this function will return true if
// to evaluate the output that has an alias, other values besides the aliased
// input need to be bound to the expression evaluator to evaluate the output.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this function returns true should be we set has_dynamic_alias_ to true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, took it out to a function then reverted it and didn't remember to update the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}
} // namespace

// TODO: Reduce bindings to only those necessaary to resolve missing params.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: Reduce bindings to only those necessaary to resolve missing params.
// TODO: Reduce bindings to only those necessary to resolve missing params.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

continue;
}
}
if (out_tv->isConst()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think isConst on a tensor is always false. What is this part supposed to do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This should be input->isConst(), not out_tv->isConst.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be in the loop above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

auto alias_info = tv->fusion()->getOutputAlias(tv);
if (alias_info.type == AllocationType::Evaluate) {
auto val = expr_eval.evaluate(tv);
NVF_ERROR(val.hasValue() && val.is<at::Tensor>(), "Output is not a Tensor");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:the hasValue check is redundant.

Suggested change
NVF_ERROR(val.hasValue() && val.is<at::Tensor>(), "Output is not a Tensor");
NVF_ERROR(val.is<at::Tensor>(), "Output is not a Tensor");

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that, thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed redundant check.

NVF_ERROR(
input.is<at::Tensor>() && input.as<at::Tensor>().is_cuda(),
"Only CUDA tensors are supported for direct nvRTC launches at this time.");
if (input.is<at::Tensor>() && input.as<at::Tensor>().is_cuda()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This check is redundant

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines +525 to +528
info.shape_info.allocation_sizes = sizes;
info.shape_info.allocation_strides = strides;
info.shape_info.logical_sizes = sizes;
info.shape_info.logical_strides = strides;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making a copy of sizes to allocation_sizes and logical_sizes, could we just make it empty, and say that "if allocation size is empty, then it is equal to logical size", this way we can save a copy.

If you believe this micro-optimization is unimportant, at least we can replace the last line as:

    info.shape_info.logical_sizes = std::move(sizes);
    info.shape_info.logical_strides = std::move(strides);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a reasonable approach, that's what I did it for the unsharded logical sizes. If it's quick enough I'll do in this PR, if not I'll leave it to the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

out_tensors[out_idx] = inp;
} else if (
fusion->getOutputAlias(out_info.tv).type == AllocationType::Evaluate) {
if (dynamic_evaluate) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood this PR tries to to avoid expression evaluations from the host path, which is great. But I couldn't understand what this flag gives us.

AFAICT from https://github.com/NVIDIA/Fuser/pull/3952/files#diff-fc385708d95b79cfddb58136479569b5d0aa608458bcb19759afedb788f308ddR1081, with this PR, the logic is

if the fusion has dynamic aliases:
  Skip evaluating dynamic aliases in allocateOutputs
  Instead, evaluate them after allocateOutputs
else:
  Allocate/evaluate for all outputs in allocateOutputs
  # Because there are no dynamic aliases, we are sure this won't call ExpressionEvaluator

Can we simply set dynamic_alias always to true and do the if-then branch in the callers of allocateOutputs? This way, dynamic aliases (aka AllocateType::Evaluate) are always computed outside allocateOutputs.

Alternatively, can we simply set dynamic_alias to false and do the if-else branch in the callers? This way, dynamic aliases are always computed inside allocateOutputs, as before. I don't see much difference between evaluating dynamic aliases inside vs outside. They both call ExpressionEvaluator to bind inputs and evaluate outputs.

Note: HostIrExecutor is for communication only, so the fusion it runs always has (or at least should have) AllocateType::New.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great speedup btw!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I follow correctly the logic in KernelExecutor::run is:

allocate outputs (dynamic alias)
  Allocate new buffers
  Alias reuse buffers
  if(!dynamic alias)
    bind one input alias
    evaluate output alias
if dynamic alias
  bind all inputs
  evaluate any missing outputs

I'm not sure if I understood correctly, so apologies if I'm missing something, but I think the reason we send dynamic alias to allocate outputs is because if it's not true we can drastically simplify what needs to be bound to expression evaluator to infer the outputs. With more work dynamic alias could bind only what's necessary, but today when dynamic alias we bind everything. When non-dynamic alias we still need to bind something, but I special cased on only binding the input which I believe is far more efficient than binding all inputs and all sizes of each input.

This may be an uncommon special case, though I hope in the future we can improve the amount of bindings and evaluations for dynamic cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we lift the if(!dynamic alias) case to executor.cpp? This would move all allocation::evaluate logic to be together, maybe that's what you're suggesting would help the logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's true for host ir executor we could remove the extra processing logic there. Do you think it will stay that way?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! I realized I missed several important details:

  1. AllocateType::Evaluate includes more than "dynamic aliases". For example, ExprEvalScheduler marks matmul outputs as ::Evaluate which alias to nothing.
  2. You believe that binding only input aliases is much faster than binding all inputs.
  3. Even when an output is ::Evaluate and has an aliased_io, computing it may require binding inputs other than aliased_io.

Based on these new understandings, I can't think of a simpler way than the current PR without significant changes.

One thing I might consider in the future is to set alias_io only when binding just that is sufficient to compute the output. This way, we don't need a separate dynamic_alias parameter and the logic can be simplified to

allocate outputs
  Allocate new buffers
  Alias reuse buffers
  For each dynamic alias
    if aliased_io is null:
      set output to monostate
    else:
      bind one input alias
      evaluate output alias
If any output is missing (i.e. monostate):
  bind all inputs
  evaluate any missing outputs

Anyhow, thanks for going through the convoluted existing logic here! Most of it came from me trying to support meta ops.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's true for host ir executor we could remove the extra processing logic there.

Yes, I believe https://github.com/NVIDIA/Fuser/pull/3952/files#diff-6f15ae1df14ce7b6b049b46305a6b7991731bdcd859fba15a09b63adb77ca670R151-R161 is unnecessary.

Do you think it will stay that way?

Yes. I don't plan to use HostIrExecutor more than launching communications. The ongoing host IR integration happens in HostIrEvaluator (not HostIrExecutor). I believe we will run into the same latency issue with ExpressionEvaluator when we try to turn on host IR by default. Yikes!

@csarofeen
Copy link
Collaborator Author

!test

Copy link
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@csarofeen
Copy link
Collaborator Author

!build

@csarofeen csarofeen merged commit 2e90df8 into main Mar 5, 2025
15 checks passed
@csarofeen csarofeen deleted the executor_cleanup branch March 5, 2025 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants