Skip to content

[DO NOT MERGE] Add some llama latency scripts for testing#3813

Closed
csarofeen wants to merge 18 commits intomainfrom
llama_testing
Closed

[DO NOT MERGE] Add some llama latency scripts for testing#3813
csarofeen wants to merge 18 commits intomainfrom
llama_testing

Conversation

@csarofeen
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Feb 3, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
⚡ Recommended focus areas for review

Performance Overhead

The addition of FUSER_PERF_SCOPE in multiple functions may introduce performance overhead. Ensure that the performance impact is minimal and that the profiling scopes are necessary for the intended performance analysis.

    FUSER_PERF_SCOPE("Val::evaluate");
  }

  ExpressionEvaluator ee;
  auto evaluated_val = ee.evaluate(this);
  NVF_ERROR(
      evaluated_val.hasValue(),
      "Detected a const value but failed to infer its value: ",
      toInlineString());
  return evaluated_val;
}

std::vector<PolymorphicValue> Expr::evaluate(
    const ExpressionEvaluator& ee,
    const std::vector<PolymorphicValue>& inputs) const {
  FUSER_PERF_SCOPE("Expr::evaluate");
  NVF_THROW(
      "`evaluate` method for expression ",
      getOpString(),
      " is not defined. ",
      "Please override the evaluate method");
}

std::vector<PolymorphicValue> Expr::evaluate(
    const ExpressionEvaluator& ee,
    std::unordered_map<const Val*, PolymorphicValue>& known_values) const {
  FUSER_PERF_SCOPE("Expr::evaluate");
  std::vector<PolymorphicValue> expr_inputs;
  expr_inputs.reserve(inputs().size());
  for (auto inp : inputs()) {
    const auto& eval_i = ee.evaluate(inp, known_values);
    if (!eval_i.hasValue()) {
      return {std::monostate{}};
    }
    expr_inputs.emplace_back(eval_i);
  }
  return this->evaluate(ee, expr_inputs);
}

void Expr::addDataAttribute(PolymorphicValue attr) {
  addAttribute(IrBuilder::createInContainer<Val>(container(), std::move(attr)));
}
std::vector<PolymorphicValue> FullOp::evaluate(
    const ExpressionEvaluator& ee,
    const std::vector<PolymorphicValue>& inputs) const {
  FUSER_PERF_SCOPE("FullOp::evaluate");
  std::vector<int64_t> shape;
  for (auto i : c10::irange(inputs.size() - 1)) {
    shape.push_back(inputs.at(i).as<int64_t>());
  }
  DataType dtype = getFillValue()->getDataType().value();
  const auto options =
      at::TensorOptions().device(at::kCUDA).dtype(data_type_to_aten(dtype));
  using namespace PolymorphicValue_functions;
  return {at::full(shape, toScalar(inputs.back()), options)};
}

std::vector<PolymorphicValue> SelectOp::evaluate(
    const ExpressionEvaluator& ee,
    const std::vector<PolymorphicValue>& inputs) const {
  FUSER_PERF_SCOPE("SelectOp::evaluate");
  const auto& in = inputs.at(0).as<at::Tensor>();
  int64_t dimension = dim();
  int64_t index = (int64_t)inputs.at(1);
  return {in.select(dimension, index)};
}

std::vector<PolymorphicValue> IndexSelectOp::evaluate(
    const ExpressionEvaluator& ee,
    const std::vector<PolymorphicValue>& inputs) const {
  FUSER_PERF_SCOPE("IndexSelectOp::evaluate");
  const auto& in = inputs.at(0).as<at::Tensor>();
  int64_t dimension = dim();
  const auto& indices = inputs.at(1).as<at::Tensor>().squeeze();
  return {at::index_select(in, dimension, indices)};
}

std::vector<PolymorphicValue> TorchGatherOp::evaluate(
    const ExpressionEvaluator& ee,
    const std::vector<PolymorphicValue>& inputs) const {
  FUSER_PERF_SCOPE("TorchGatherOp::evaluate");
  const auto& input = inputs.at(0).as<at::Tensor>();
  const auto& index = inputs.at(1).as<at::Tensor>();
  auto dimension = dim();
  if (exactSizes()) {
    return {at::take_along_dim(input, index, dimension)};
Profiling Code

The profiling code is commented out. Ensure that the profiling code is either removed or properly integrated if it is intended for performance analysis.

# for _ in range(3):
#     fd.execute(inputs)

# torch.cuda.synchronize()
# start = time.time()
# # Mark the profiling region
# torch.cuda.cudart().cudaProfilerStart()

# for _ in range(100):
#     fd.execute(inputs)

# torch.cuda.cudart().cudaProfilerStop()
# torch.cuda.synchronize()
# end = time.time()
# print(end-start)
Profiling Code

The profiling code is commented out. Ensure that the profiling code is either removed or properly integrated if it is intended for performance analysis.

# for _ in range(3):
#     fd.execute(inputs)

# torch.cuda.synchronize()
# start = time.time()
# # Mark the profiling region
# torch.cuda.cudart().cudaProfilerStart()

# for _ in range(100):
#     fd.execute(inputs)

# torch.cuda.cudart().cudaProfilerStop()
# torch.cuda.synchronize()
# end = time.time()
# print(end-start)
Profiling Code

The profiling code is commented out. Ensure that the profiling code is either removed or properly integrated if it is intended for performance analysis.

# for _ in range(3):
#     fd.execute(inputs)

# torch.cuda.synchronize()
# start = time.time()
# # Mark the profiling region
# torch.cuda.cudart().cudaProfilerStart()

# for _ in range(100):
#     fd.execute(inputs)

# torch.cuda.cudart().cudaProfilerStop()
# torch.cuda.synchronize()
# end = time.time()
# print(end-start)

@csarofeen csarofeen changed the base branch from expr_eval_exec_devel to main February 4, 2025 18:25
@csarofeen csarofeen changed the base branch from main to expr_eval_exec_devel February 4, 2025 18:40
@github-actions
Copy link

github-actions bot commented Mar 2, 2025

Review updated until commit 3622f20

Description

  • Add latency scripts for testing LLaMA inference.

  • Implement fast execution for permute operations.

  • Support dynamic reshape operations in expression evaluator.

  • Optimize input binding and reduce unnecessary computations.


Changes walkthrough 📝

Relevant files
Enhancement
graph_0.py
Add latency test for graph 0                                                         

tests/python/llama_inf_tests/graph_0.py

  • Define and execute a complex fusion graph for LLaMA inference.
  • Include timing utilities to measure execution latency.
  • +131/-0 
    graph_1.py
    Add latency test for graph 1                                                         

    tests/python/llama_inf_tests/graph_1.py

  • Define and execute a complex fusion graph for LLaMA inference.
  • Include timing utilities to measure execution latency.
  • +208/-0 
    graph_2.py
    Add latency test for graph 2                                                         

    tests/python/llama_inf_tests/graph_2.py

  • Define and execute a complex fusion graph for LLaMA inference.
  • Include timing utilities to measure execution latency.
  • +158/-0 

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Performance Gap

    The performance gap between "Before" and "After" is not significant. The performance improvement is only from 12.0 ms to 3.1 ms, which seems unusual. Verify if the performance metrics are correct and if the changes are expected to have such a large impact.

    print((end-start)*1000, " ms")
    
    # Before:
    # 12.0  ms
    # After:
    # 3.1  ms
    Performance Gap

    The performance gap between "Before" and "After" is not significant. The performance improvement is only from 19.8 ms to 10.6 ms, which seems unusual. Verify if the performance metrics are correct and if the changes are expected to have such a large impact.

    print((end-start)*1000, " ms")
    
    
    # Before:
    # 19.8  ms
    # After:
    # 10.6  ms
    Performance Gap

    The performance gap between "Before" and "After" is minimal. The performance improvement is only from 18.9 ms to 18.8 ms, which seems unusual. Verify if the performance metrics are correct and if the changes are expected to have such a large impact.

    # Before:
    # 18.9  ms
    # After:
    # 18.8 ms
    
    
    # rm report*
    # nsys profile -c cudaProfilerApi python tests/python/llama_inf_tests/graph_2.py

    @csarofeen csarofeen changed the base branch from expr_eval_exec_devel to main March 2, 2025 21:49
    csarofeen added a commit that referenced this pull request Mar 5, 2025
    Removes as much expression evaluation as possible on matching inputs to
    KernelExecutor. Results for llama based latency tests on H200-DGX are
    tracked in #3813. In that PR there
    are also CPU based profiling results showing how much latency has been
    improved for KernelExecutor. This PR does not add any new functionality.
    
    Graph 0:
    Total time: 12ms -> 3.1ms
    KernelExecutor::runFusion: 35.5us -> 10.5us.
    
    Graph 1:
    Total time: 19.8ms -> 10.6ms
    KernelExecutor::runFusion: 36.4us -> 13.1us
    
    Graph 2:
    Total time: 18.9ms -> 18.8ms
    KernelExecutor::runFusion: 28.6 us -> 11.1us
    
    For Graph 2 we would need to improve ExprEvalExecutor as it's taking up
    60% of the runtime and Kernel Executor only 20%.
    @csarofeen csarofeen closed this Jan 12, 2026
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant