Skip to content

Conversation

@tqchen
Copy link
Member

@tqchen tqchen commented Mar 22, 2024

Reverts #16569

@slyubomirsky
Copy link
Contributor

Could we perhaps just revert the driver changes if that's what's causing the MLC failure? They are unrelated to the main functionality of the PR and I would rather have the checking for future tests

Copy link
Contributor

@Lunderberg Lunderberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen Can this PR also add a test case for the behavior that should be preserved? From your comment here, MLC is currently relying on untested functionality in TVM, and I don't think we should revert based solely on untested functionality. Doing so would mean that there would neither be a path forward for re-applying the breaking change, nor any protection for the same breakage occurring in other PRs

I think it makes sense to revert commits that break important workflows, so long as there's a reproducible example of the failure. Otherwise, the responsibility for reproducing that failure mode falls on a developer who wasn't actually seeing the failure.

Edit: Not requesting a perfect unit test that exercises the specific bug, just a high-level test that would trigger the bug in CI.

@tqchen
Copy link
Member Author

tqchen commented Mar 22, 2024

@slyubomirsky reverting the driver change also works, just wonder if that will break some of the existing tests

@slyubomirsky
Copy link
Contributor

The test should have already been broken, oddly! The error was not triggered by anything I added.

@tqchen
Copy link
Member Author

tqchen commented Mar 22, 2024

agree that we should have a carved out testcases here, will followup in comment soon, as well as in the original thread

I believe it was along the original lines of comment in the changed order, which i assume we say that we cannot merge dyn smem beyond the kernel boundary,

  // MergeSharedMemoryAllocations must be applied after SplitHostDevice
  // because the merged allocation site is at the beginning of each device function
  mixed_pass_list.push_back(tir::transform::MergeSharedMemoryAllocations());

@slyubomirsky
Copy link
Contributor

slyubomirsky commented Mar 22, 2024

tests/python/tir-transform/test_tir_transform_inject_ptx_async_copy.py::test_vectorize_cp_async_in_if_then_else was the test that prompted me to change the ordering and it had been failing for several weeks before my PR, oddly (according to a git bisect). For some reason it was not triggering in CI though.

@tqchen
Copy link
Member Author

tqchen commented Mar 22, 2024

OK, given that case we should perhaps first revert the driver order only and disable(then look into the failed tests).

We will still get a UT for the ordering case, which I assume would might relates to getting that dlight function compiled, and then reduce to a smaller case

int dev_type = target->GetTargetDeviceType();
if (!(dev_type == kDLCUDA || dev_type == kDLMetal || dev_type == kDLROCM ||
dev_type == kDLWebGPU)) {
if (dev_type != kDLCUDA) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was causing a test failure. I will find the link to the discussion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -535 to -540
if (op->thread_binding.defined()) {
auto old_thread_binding = op->thread_binding.value();
auto* ptr = old_thread_binding.CopyOnWrite();
ptr->var = old_thread_binding->var.copy_with_dtype(new_loop_var.dtype());
n->thread_binding = std::move(Optional<IterVar>(std::move(old_thread_binding)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to address #16634 (comment)

@slyubomirsky
Copy link
Contributor

#16770 Smaller-scale reversion

@Lunderberg
Copy link
Contributor

Thank you, @tqchen , for the follow-up, and for helping to make the more targeted reversion!

@tqchen tqchen closed this Mar 22, 2024
@tqchen
Copy link
Member Author

tqchen commented Mar 22, 2024

closing in favor of #16770. Thanks @slyubomirsky @Lunderberg

@tqchen tqchen deleted the revert-16569-parser-check-well-formed branch March 29, 2024 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants