[DO NOT MERGE] Add some llama latency scripts for testing by csarofeen · Pull Request #3813 · NVIDIA/Fuser

csarofeen · 2025-02-03T21:43:15Z

No description provided.

…ecessary for inference. Still need to fix RehsapeToSlice test.

…calar's aren't an input value based on the InputsOf function.

github-actions · 2025-02-03T21:43:54Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪

🧪 PR contains tests

⚡ Recommended focus areas for review

Performance Overhead

The addition of FUSER_PERF_SCOPE in multiple functions may introduce performance overhead. Ensure that the performance impact is minimal and that the profiling scopes are necessary for the intended performance analysis.

    FUSER_PERF_SCOPE("Val::evaluate");
  }

  ExpressionEvaluator ee;
  auto evaluated_val = ee.evaluate(this);
  NVF_ERROR(
      evaluated_val.hasValue(),
      "Detected a const value but failed to infer its value: ",
      toInlineString());
  return evaluated_val;
}

std::vector<PolymorphicValue> Expr::evaluate(
    const ExpressionEvaluator& ee,
    const std::vector<PolymorphicValue>& inputs) const {
  FUSER_PERF_SCOPE("Expr::evaluate");
  NVF_THROW(
      "`evaluate` method for expression ",
      getOpString(),
      " is not defined. ",
      "Please override the evaluate method");
}

std::vector<PolymorphicValue> Expr::evaluate(
    const ExpressionEvaluator& ee,
    std::unordered_map<const Val*, PolymorphicValue>& known_values) const {
  FUSER_PERF_SCOPE("Expr::evaluate");
  std::vector<PolymorphicValue> expr_inputs;
  expr_inputs.reserve(inputs().size());
  for (auto inp : inputs()) {
    const auto& eval_i = ee.evaluate(inp, known_values);
    if (!eval_i.hasValue()) {
      return {std::monostate{}};
    }
    expr_inputs.emplace_back(eval_i);
  }
  return this->evaluate(ee, expr_inputs);
}

void Expr::addDataAttribute(PolymorphicValue attr) {
  addAttribute(IrBuilder::createInContainer<Val>(container(), std::move(attr)));
}
std::vector<PolymorphicValue> FullOp::evaluate(
    const ExpressionEvaluator& ee,
    const std::vector<PolymorphicValue>& inputs) const {
  FUSER_PERF_SCOPE("FullOp::evaluate");
  std::vector<int64_t> shape;
  for (auto i : c10::irange(inputs.size() - 1)) {
    shape.push_back(inputs.at(i).as<int64_t>());
  }
  DataType dtype = getFillValue()->getDataType().value();
  const auto options =
      at::TensorOptions().device(at::kCUDA).dtype(data_type_to_aten(dtype));
  using namespace PolymorphicValue_functions;
  return {at::full(shape, toScalar(inputs.back()), options)};
}

std::vector<PolymorphicValue> SelectOp::evaluate(
    const ExpressionEvaluator& ee,
    const std::vector<PolymorphicValue>& inputs) const {
  FUSER_PERF_SCOPE("SelectOp::evaluate");
  const auto& in = inputs.at(0).as<at::Tensor>();
  int64_t dimension = dim();
  int64_t index = (int64_t)inputs.at(1);
  return {in.select(dimension, index)};
}

std::vector<PolymorphicValue> IndexSelectOp::evaluate(
    const ExpressionEvaluator& ee,
    const std::vector<PolymorphicValue>& inputs) const {
  FUSER_PERF_SCOPE("IndexSelectOp::evaluate");
  const auto& in = inputs.at(0).as<at::Tensor>();
  int64_t dimension = dim();
  const auto& indices = inputs.at(1).as<at::Tensor>().squeeze();
  return {at::index_select(in, dimension, indices)};
}

std::vector<PolymorphicValue> TorchGatherOp::evaluate(
    const ExpressionEvaluator& ee,
    const std::vector<PolymorphicValue>& inputs) const {
  FUSER_PERF_SCOPE("TorchGatherOp::evaluate");
  const auto& input = inputs.at(0).as<at::Tensor>();
  const auto& index = inputs.at(1).as<at::Tensor>();
  auto dimension = dim();
  if (exactSizes()) {
    return {at::take_along_dim(input, index, dimension)};

Profiling Code

The profiling code is commented out. Ensure that the profiling code is either removed or properly integrated if it is intended for performance analysis.

# for _ in range(3):
#     fd.execute(inputs)

# torch.cuda.synchronize()
# start = time.time()
# # Mark the profiling region
# torch.cuda.cudart().cudaProfilerStart()

# for _ in range(100):
#     fd.execute(inputs)

# torch.cuda.cudart().cudaProfilerStop()
# torch.cuda.synchronize()
# end = time.time()
# print(end-start)

Profiling Code

The profiling code is commented out. Ensure that the profiling code is either removed or properly integrated if it is intended for performance analysis.

# for _ in range(3):
#     fd.execute(inputs)

# torch.cuda.synchronize()
# start = time.time()
# # Mark the profiling region
# torch.cuda.cudart().cudaProfilerStart()

# for _ in range(100):
#     fd.execute(inputs)

# torch.cuda.cudart().cudaProfilerStop()
# torch.cuda.synchronize()
# end = time.time()
# print(end-start)

Profiling Code

The profiling code is commented out. Ensure that the profiling code is either removed or properly integrated if it is intended for performance analysis.

# for _ in range(3):
#     fd.execute(inputs)

# torch.cuda.synchronize()
# start = time.time()
# # Mark the profiling region
# torch.cuda.cudart().cudaProfilerStart()

# for _ in range(100):
#     fd.execute(inputs)

# torch.cuda.cudart().cudaProfilerStop()
# torch.cuda.synchronize()
# end = time.time()
# print(end-start)

…ting

github-actions · 2025-03-02T21:48:36Z

Review updated until commit 3622f20

Description

Add latency scripts for testing LLaMA inference.
Implement fast execution for permute operations.
Support dynamic reshape operations in expression evaluator.
Optimize input binding and reduce unnecessary computations.

Changes walkthrough 📝

Relevant files

Enhancement

graph_0.py `Add latency test for graph 0` tests/python/llama_inf_tests/graph_0.py Define and execute a complex fusion graph for LLaMA inference. Include timing utilities to measure execution latency.	+131/-0
graph_1.py `Add latency test for graph 1` tests/python/llama_inf_tests/graph_1.py Define and execute a complex fusion graph for LLaMA inference. Include timing utilities to measure execution latency.	+208/-0
graph_2.py `Add latency test for graph 2` tests/python/llama_inf_tests/graph_2.py Define and execute a complex fusion graph for LLaMA inference. Include timing utilities to measure execution latency.	+158/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Performance Gap The performance gap between "Before" and "After" is not significant. The performance improvement is only from 12.0 ms to 3.1 ms, which seems unusual. Verify if the performance metrics are correct and if the changes are expected to have such a large impact. print((end-start)1000, " ms") # Before: # 12.0 ms # After: # 3.1 ms Performance Gap* The performance gap between "Before" and "After" is not significant. The performance improvement is only from 19.8 ms to 10.6 ms, which seems unusual. Verify if the performance metrics are correct and if the changes are expected to have such a large impact. print((end-start)1000, " ms") # Before: # 19.8 ms # After: # 10.6 ms Performance Gap* The performance gap between "Before" and "After" is minimal. The performance improvement is only from 18.9 ms to 18.8 ms, which seems unusual. Verify if the performance metrics are correct and if the changes are expected to have such a large impact. # Before: # 18.9 ms # After: # 18.8 ms # rm report* # nsys profile -c cudaProfilerApi python tests/python/llama_inf_tests/graph_2.py

Removes as much expression evaluation as possible on matching inputs to KernelExecutor. Results for llama based latency tests on H200-DGX are tracked in #3813. In that PR there are also CPU based profiling results showing how much latency has been improved for KernelExecutor. This PR does not add any new functionality. Graph 0: Total time: 12ms -> 3.1ms KernelExecutor::runFusion: 35.5us -> 10.5us. Graph 1: Total time: 19.8ms -> 10.6ms KernelExecutor::runFusion: 36.4us -> 13.1us Graph 2: Total time: 18.9ms -> 18.8ms KernelExecutor::runFusion: 28.6 us -> 11.1us For Graph 2 we would need to improve ExprEvalExecutor as it's taking up 60% of the runtime and Kernel Executor only 20%.

csarofeen added 12 commits January 29, 2025 13:35

Move all ir evaluation definitions to one file.

7dcd53e

Missed a line change.

cfce8de

Forgot license.

5cfad14

Move expression evaluator executor to its own file.

1fdbb40

Add fast execution for permute. Add temporary timing utilities.

d9a627a

Debugging input binding

cf4d3f8

Remove recursive binding of tensors.

398e7a5

Support dynamic reshape ops in expr eval exec.

e2e6f18

Simplify tensor extents in eee.

cb41c52

Finish implementing dynamic reshape operations and only bind what's n…

0120992

…ecessary for inference. Still need to fix RehsapeToSlice test.

Fix ReshapeToSlice test, enable binding input tensor sizes when the s…

d9df08c

…calar's aren't an input value based on the InputsOf function.

Add tests and instrument evaluate

9b56257

csarofeen changed the base branch from expr_eval_exec_devel to main February 4, 2025 18:25

csarofeen added 2 commits February 4, 2025 10:38

Extraneous start profiling call.

a38ce70

Merge branch 'expr_eval_exec_devel' into llama_testing

c0d30fa

csarofeen force-pushed the llama_testing branch from 45dc51d to c0d30fa Compare February 4, 2025 18:39

csarofeen changed the base branch from main to expr_eval_exec_devel February 4, 2025 18:40

csarofeen added 2 commits March 2, 2025 13:45

Update with executor_cleanup results.

73c0dee

Merge branch 'main' of https://github.com/NVIDIA/Fuser into llama_tes…

198763d

…ting

csarofeen changed the base branch from expr_eval_exec_devel to main March 2, 2025 21:49

csarofeen added 2 commits March 2, 2025 13:49

Revert changes from expr eval executor prototype.

fec59b5

Revert changes from expr eval executor prototype.

3622f20

csarofeen mentioned this pull request Mar 2, 2025

Remove ExpressionEvaluator from KernelExecutor::run #3952

Merged

csarofeen closed this Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Add some llama latency scripts for testing#3813

[DO NOT MERGE] Add some llama latency scripts for testing#3813
csarofeen wants to merge 18 commits intomainfrom
llama_testing

csarofeen commented Feb 3, 2025

Uh oh!

github-actions bot commented Feb 3, 2025

Uh oh!

github-actions bot commented Mar 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

csarofeen commented Feb 3, 2025

Uh oh!

github-actions bot commented Feb 3, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Mar 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 2, 2025 •

edited

Loading