Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion csrc/host_ir/container.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ HostIrContainer::HostIrContainer(int64_t num_kernel_executors)
HostIrContainer::~HostIrContainer() = default;

Stream* HostIrContainer::getDefaultStream() {
if (!default_stream_) {
if (default_stream_ == nullptr) {
default_stream_ = IrBuilder::createInContainer<Stream>(this);
}
return default_stream_;
Expand All @@ -35,6 +35,11 @@ Stream* HostIrContainer::getDefaultStream() {
std::ostream& HostIrContainer::print(std::ostream& os) const {
IrMathPrinter op_exprs(os);
op_exprs.handle(this);
os << "Aliases:{";
for (const auto& alias : alias_) {
os << "\n " << alias.first << " -> " << alias.second;
}
os << "\n}\n";
return os;
}

Expand Down
12 changes: 12 additions & 0 deletions csrc/host_ir/container.h
Original file line number Diff line number Diff line change
Expand Up @@ -55,10 +55,22 @@ class HostIrContainer final : public Fusion {

Stream* getDefaultStream();

void markAlias(TensorView* original, const TensorView* new_alias) {
while (alias_.count(original)) {
original = alias_[original]->as<TensorView>();
}
alias_[new_alias] = original;
}

const auto& alias() const {
return alias_;
}

private:
std::vector<Expr*> top_level_exprs_;
std::vector<std::unique_ptr<KernelExecutor>> kernel_executors_;
Stream* default_stream_ = nullptr;
std::unordered_map<const Val*, Val*> alias_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not TensorView* since markAlias takes TensorView?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more convenient imo to keep a Val* here to have a uniform treatment of Scalars and TensorViews when we get/set, through the method getAlias.
If changing the signature to TensorView here, we need to branch into getAlias to maybe downcast the Val* to a TensorView*, get the alias, and then upcast back the obtained TensorView* to a Val*.

This is doable but imo slightly less natural.
Since, besides, we restrict aliasing to TensorView only for simplicity, but not for a structural reason, this is why I made this choice.

Let me know what you prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but not for a structural reason

What does it mean for a scalar to alias another scalar?

Copy link
Collaborator Author

@samnordmann samnordmann Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but not for a structural reason

What does it mean for a scalar to alias another scalar?

Like for a Tensor: a different name pointing to the same data, (IOW seeing the scalar as a 0-dim tensor)
But there is no real motivation to support that for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, scalar is not a 0-dim tensor. The difference is like int vs int*. To me, aliases for pointers make sense but aliases for scalars don't.

I'm fine with this code as is. Personally, I prefer std::unordered_map<TensorView*, TensorView*> because it gives the right impression that only TensorViews can alias. std::unordered_map<Val*, TensorView*> is OK as well if the key is too often a Val* and you want to save too much typing of ->as<TensorView>().

Copy link
Collaborator Author

@samnordmann samnordmann Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, scalar is not a 0-dim tensor. The difference is like int vs int*. To me, aliases for pointers make sense but aliases for scalars don't.

Ok I leave it as is for now. For the sake of the discussion though and to make sure I am not missing something:

A scalar can be viewed as a 0-dim tensor, mathematically. Then, it is an implementation detail whether the scalar type owns a pointer or a value. I understand that pytorch makes the choice that at::Scalar represents the value and not the pointer, contrarily to at::Tensor. However, in our context, we are adding one level of indirection (through hir-aliasing and expression evaluator), so aliasing is always possible, even for scalars. More precisely: the symbolic object (scalar or tensor) is a Val*, mapped through (hir-alias+ExprEvaluator) to a PolymorphicValue (e.g. a at::Tensor or a at::Scalar). Two Val* can always be mapped to the same concrete value, IOW, aliasing is always possible.

};

} // namespace hir
Expand Down
218 changes: 110 additions & 108 deletions csrc/host_ir/executor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -191,32 +191,6 @@ KernelArgumentHolder HostIrExecutor::run(

namespace hir {

namespace {

at::Tensor getKnownTensorOrUndefined(
Val* val,
const ExpressionEvaluator& expr_evaluator) {
return expr_evaluator.isKnown(val)
? expr_evaluator.evaluate(val).as<at::Tensor>()
: at::Tensor();
}

KernelArgumentHolder getKnownTensorOrUndefined(
const std::vector<Val*>& vals,
const ExpressionEvaluator& expr_evaluator) {
std::vector<at::Tensor> tensors(vals.size());
std::transform(
vals.begin(),
vals.end(),
tensors.begin(),
[&expr_evaluator](Val* val) -> at::Tensor {
return getKnownTensorOrUndefined(val, expr_evaluator);
});
return KernelArgumentHolder(tensors);
}

} // namespace

HostIrEvaluator::HostIrEvaluator(
std::unique_ptr<HostIrContainer> container,
Communicator* communicator,
Expand All @@ -238,28 +212,39 @@ HostIrEvaluator::HostIrEvaluator(
{container_->getDefaultStream(),
c10::cuda::getDefaultCUDAStream(
static_cast<c10::DeviceIndex>(device_index))});
expr_evaluator_.bind("numberOfStreams", params_.number_of_streams);
NVF_ERROR(
std::all_of(
container_->inputs().begin(),
container_->inputs().end(),
[this](Val* input) { return !container_->alias().count(input); }),
"Inputs cannot be aliased");
}

KernelArgumentHolder HostIrEvaluator::dispatchAndCollectOutputs() {
KernelArgumentHolder HostIrEvaluator::runWithInput(
const std::unordered_map<Val*, PolymorphicValue>& val_to_PValue) {
expr_evaluator_ = ExpressionEvaluator();
expr_evaluator_.bind("numberOfStreams", params_.number_of_streams);
// process input values, converting IValue to PolymorphicValue
for (const auto& [val, pvalue] : val_to_PValue) {
bind(val, pvalue);
}

// Interpret each instruction in an "eager" way by iterate over the Host Ir
// Container's top level expression list
for (auto expr : container_->topLevelExprs()) {
dispatch(expr);
}

// Collect global outputs
return getKnownTensorOrUndefined(container_->outputs(), expr_evaluator_);
}

KernelArgumentHolder HostIrEvaluator::runWithInput(
const std::unordered_map<Val*, PolymorphicValue>& val_to_PValue) {
// process input values, converting IValue to PolymorphicValue
for (const auto& [val, pvalue] : val_to_PValue) {
expr_evaluator_.bind(val, pvalue);
}

return dispatchAndCollectOutputs();
std::vector<at::Tensor> outputs(container_->outputs().size());
std::transform(
container_->outputs().begin(),
container_->outputs().end(),
outputs.begin(),
[this](Val* val) -> at::Tensor {
return this->getKnownTensorOrUndefined(val);
});
return KernelArgumentHolder(outputs);
}

std::string HostIrEvaluator::canRun() const {
Expand Down Expand Up @@ -342,13 +327,7 @@ void HostIrEvaluator::handle(Synchronize* synchronize) {
void HostIrEvaluator::handle(LaunchKernel* launch_kernel) {
KernelArgumentHolder args;
for (auto& input : launch_kernel->inputs()) {
NVF_ERROR(
expr_evaluator_.isKnown(input),
"No buffer associated with Val ",
input,
" for handling ",
launch_kernel->toString());
args.push(expr_evaluator_.evaluate(input));
args.push(getKnownConcreteValue(input));
}
args.setDeviceIndex();

Expand All @@ -363,25 +342,35 @@ void HostIrEvaluator::handle(LaunchKernel* launch_kernel) {

// Store the outputs in the context
for (auto output_idx : arange(outputs.size())) {
expr_evaluator_.bind(
launch_kernel->outputs().at(output_idx), outputs[output_idx]);
bind(launch_kernel->outputs().at(output_idx), outputs[output_idx]);
}
}

void HostIrEvaluator::handle(PostOnStream* post_ir) {
KernelArgumentHolder input_args;
for (auto& input : post_ir->inputs()) {
NVF_ERROR(
expr_evaluator_.isKnown(input),
"No buffer associated with Val ",
input,
" for handling ",
post_ir->toString());
input_args.push(expr_evaluator_.evaluate(input));
input_args.push(getKnownConcreteValue(input));
}
input_args.setDeviceIndex();
// placeholder for storing the outputs
KernelArgumentHolder outputs;
bool use_preallocated_outputs = std::all_of(
post_ir->outputs().begin(),
post_ir->outputs().end(),
[this](Val* output) { return this->isKnown(output); });
NVF_ERROR(
use_preallocated_outputs ||
std::all_of(
post_ir->outputs().begin(),
post_ir->outputs().end(),
[this](Val* output) { return !this->isKnown(output); }),
"outputs must be all or none preallocated in expr ",
post_ir);
if (use_preallocated_outputs) {
for (auto output : post_ir->outputs()) {
outputs.push(getKnownConcreteValue(output));
}
}

NVF_ERROR(
post_ir->hostOpToPost()->isA<HostUnit>(),
Expand All @@ -398,16 +387,23 @@ void HostIrEvaluator::handle(PostOnStream* post_ir) {
/*fusion_id=*/0,
!params_.skip_auto_scheduling);
}
outputs = fec_.at(hu).runFusionWithInputs(input_args);
if (use_preallocated_outputs) {
TORCH_WARN(
"FusionExecutorCache does not support with preallocated outputs, so we are copying the outputs in expr ",
post_ir);
auto tmp_outputs = fec_.at(hu).runFusionWithInputs(input_args);
for (auto output_idx : c10::irange(tmp_outputs.size())) {
outputs[output_idx].as<at::Tensor>().copy_(
tmp_outputs[output_idx].as<at::Tensor>());
}
} else {
outputs = fec_.at(hu).runFusionWithInputs(input_args);
}
} else {
// This path should generally be avoided as it will likely send the fusion
// held in HostUnit directly to KernelExecutor which means it will try to
// compile and run a device kernel with a single thread.
if (auto it = executors_.find(hu); it != executors_.end()) {
ExecutorAbstract* ea = it->second.get();
outputs = ExecutorDispatch::run(ea, input_args);

} else {
if (auto it = executors_.find(hu); it == executors_.end()) {
DynamicTransform::concretizeFusion(hu->fusion_to_execute(), input_args);
auto it2 = executors_.insert(
{hu,
Expand All @@ -424,14 +420,20 @@ void HostIrEvaluator::handle(PostOnStream* post_ir) {
} else {
ExecutorDispatch::compile(ea, hu->fusion_to_execute());
}
}
ExecutorAbstract* ea = executors_[hu].get();
if (use_preallocated_outputs) {
ExecutorDispatch::run(ea, input_args, outputs);
} else {
outputs = ExecutorDispatch::run(ea, input_args);
}
}

// Store the outputs in the context
for (auto output_idx : arange(outputs.size())) {
expr_evaluator_.bind(
post_ir->outputs().at(output_idx), outputs[output_idx]);
if (!use_preallocated_outputs) {
// Store the outputs in the context
for (auto output_idx : arange(outputs.size())) {
bind(post_ir->outputs().at(output_idx), outputs[output_idx]);
}
}
}

Expand All @@ -444,10 +446,9 @@ void HostIrEvaluator::handle(Communication* communication) {
communicator_ != nullptr && communicator_->is_available(),
"A valid communicator must be provided");

at::Tensor input_tensor =
getKnownTensorOrUndefined(communication->input(0), expr_evaluator_);
at::Tensor input_tensor = getKnownTensorOrUndefined(communication->input(0));
at::Tensor output_tensor =
getKnownTensorOrUndefined(communication->output(0), expr_evaluator_);
getKnownTensorOrUndefined(communication->output(0));

CommunicatorBackend backend_type = communication->backend();
c10d::Backend* backend =
Expand All @@ -471,10 +472,9 @@ void HostIrEvaluator::handle(P2PCommunication* communication) {
communicator_ != nullptr && communicator_->is_available(),
"A valid communicator must be provided");

CommunicatorBackend backend_type = communication->backend();
at::Tensor buffer =
getKnownTensorOrUndefined(communication->buffer(), expr_evaluator_);
at::Tensor buffer = getKnownTensorOrUndefined(communication->buffer());

CommunicatorBackend backend_type = communication->backend();
if (backend_type == CommunicatorBackend::kCuda) {
const P2pIpcHandle& p2p_ipc_handle = ipc_handle_cache_.get(communication);
const auto current_stream = static_cast<CUstream>(
Expand Down Expand Up @@ -556,11 +556,11 @@ void HostIrEvaluator::handle(ForLoop* for_loop) {

for (auto i = start; i < stop; i += step) {
// invalidate i and its consumers before binding
expr_evaluator_.invalidate(for_loop->index());
invalidate(for_loop->index());
for (auto consumer : allConsumerValsOf(for_loop->index())) {
expr_evaluator_.invalidate(consumer);
invalidate(consumer);
}
expr_evaluator_.bind(for_loop->index(), i);
bind(for_loop->index(), i);
for (Expr* expr : for_loop->body().exprs()) {
dispatch(expr);
}
Expand Down Expand Up @@ -597,15 +597,11 @@ void HostIrEvaluator::handle(MatmulOp* matmul) {
TensorView* a = matmul->inA();
TensorView* b = matmul->inB();
TensorView* out = matmul->out();
NVF_ERROR(
expr_evaluator_.isKnown(a) && expr_evaluator_.isKnown(b),
"Inputs of the matmul ",
matmul->toString(),
"must be precomputed before being retrieved");
if (expr_evaluator_.isKnown(out)) {
auto t_a = expr_evaluator_.evaluate(a).as<at::Tensor>();
auto t_b = expr_evaluator_.evaluate(b).as<at::Tensor>();
auto t_out = expr_evaluator_.evaluate(out).as<at::Tensor>();

if (isKnown(out)) {
auto t_a = getKnownConcreteValue(a).as<at::Tensor>();
auto t_b = getKnownConcreteValue(b).as<at::Tensor>();
auto t_out = getKnownConcreteValue(out).as<at::Tensor>();
at::matmul_out(t_out, t_a, t_b);
} else {
unhandled(matmul);
Expand All @@ -617,24 +613,18 @@ void HostIrEvaluator::handle(LinearOp* linear) {
TensorView* weight = linear->inB()->as<TensorView>();
TensorView* bias = linear->bias()->as<TensorView>();
TensorView* out = linear->out()->as<TensorView>();
NVF_ERROR(
expr_evaluator_.isKnown(in) && expr_evaluator_.isKnown(weight) &&
(!linear->has_bias() || expr_evaluator_.isKnown(bias)),
"Inputs of the Linear Op ",
linear->toString(),
"must be precomputed before being retrieved");

if (!expr_evaluator_.isKnown(out)) {
if (!isKnown(out)) {
unhandled(linear);
return;
}

auto in_at = expr_evaluator_.evaluate(in).as<at::Tensor>();
auto weight_at = expr_evaluator_.evaluate(weight).as<at::Tensor>();
auto out_at = expr_evaluator_.evaluate(out).as<at::Tensor>();
auto in_at = getKnownConcreteValue(in).as<at::Tensor>();
auto weight_at = getKnownConcreteValue(weight).as<at::Tensor>();
auto out_at = getKnownConcreteValue(out).as<at::Tensor>();

if (linear->has_bias()) {
auto bias_at = expr_evaluator_.evaluate(bias).as<at::Tensor>();
auto bias_at = getKnownConcreteValue(bias).as<at::Tensor>();
at::linear_out(out_at, in_at, weight_at.squeeze(), bias_at.squeeze());
} else {
at::linear_out(out_at, in_at, weight_at.squeeze());
Expand All @@ -661,25 +651,37 @@ void HostIrEvaluator::handle(kir::Allocate* allocate) {
c10::nullopt,
device,
c10::nullopt);

expr_evaluator_.bind(tv, tensor);
bind(tv, tensor);
}

void HostIrEvaluator::unhandled(Statement* stmt) {
NVF_ERROR(stmt->isA<Expr>(), stmt, " must be an Expr");
auto* expr = stmt->as<Expr>();
for (auto input : ir_utils::filterByType<TensorView>(expr->inputs())) {
NVF_ERROR(
expr_evaluator_.isKnown(input),
"input ",
input->toString(),
" of the expression ",
expr->toString(),
"must be precomputed before being retrieved");
}
for (auto output : expr->outputs()) {
expr_evaluator_.bind(
output, expr_evaluator_.evaluate(output), /*evaluate_validate=*/true);
std::vector<PolymorphicValue> inputs;
for (auto input : expr->inputs()) {
if (input->isA<TensorView>()) {
// Tensor inputs must be already computed at this point
inputs.push_back(getKnownConcreteValue(input));
} else {
inputs.push_back(expr_evaluator_.evaluate(input));
}
}

// Check that there is no pre-allocated output
NVF_ERROR(
std::all_of(
expr->outputs().begin(),
expr->outputs().end(),
[this](Val* output) {
return !this->expr_evaluator_.isKnown(output);
}),
"Do not support pre-allocated outputs for the op ",
expr);
// using ExpressionEvaluator::evaluate to evaluate the output is not valid
// here if the output or one of its producer is an alias
auto concrete_outputs = expr->evaluate(expr_evaluator_, inputs);
for (int64_t i : c10::irange(expr->outputs().size())) {
bind(expr->output(i), concrete_outputs.at(i));
}
}

Expand Down
Loading