Skip to content

Enable the newly added OuterPersistent scheduler#916

Merged
liqiangxl merged 8 commits intomainfrom
llu/enable_outer_scheduler
Sep 22, 2023
Merged

Enable the newly added OuterPersistent scheduler#916
liqiangxl merged 8 commits intomainfrom
llu/enable_outer_scheduler

Conversation

@liqiangxl
Copy link
Collaborator

Enable the newly added OuterPersistent scheduler. It requires outer reduction tvs without inner reduction tvs.
Its priority is higher than the original Persistent scheduler which can process inner, outer, and inner_outer.

@liqiangxl
Copy link
Collaborator Author

!build

}

bool PersistentKernelScheduler::canScheduleCompileTime(Fusion* fusion) {
// This scheduler is being divided into three separate schedulers and should
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disable this deprecated scheduler.

@liqiangxl
Copy link
Collaborator Author

!build

@liqiangxl
Copy link
Collaborator Author

!build

@liqiangxl
Copy link
Collaborator Author

!build

@liqiangxl liqiangxl marked this pull request as ready for review September 21, 2023 21:13
@liqiangxl liqiangxl requested a review from naoyam September 21, 2023 21:13
@liqiangxl
Copy link
Collaborator Author

!build

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@naoyam
Copy link
Collaborator

naoyam commented Sep 22, 2023

The codegen diff tests seem to have failed due to a build error:

00:01:57 /usr/bin/ld: libnvfuser_codegen.so: undefined reference to `cuTensorMapEncodeTiled'

This seems to be related to the recent TMA PR by Xiang, but I don't know why the error only shows up now. Any idea? @zasdfgbnm, @xwang233

@zasdfgbnm
Copy link
Collaborator

@xwang233 Are we using ccache in our build, and would it help to clean all the cache? I used to use ccache and it used to cause weird error like this and that's why I stopped using it anymore. But this experience was 10+ years ago.

@xwang233
Copy link
Collaborator

xwang233 commented Sep 22, 2023

@zasdfgbnm , hey, I guess the cause of those build issues in codegen diff jobs is that, the base image we used to build your PR was from last night where the TMA PR was not merged to main yet. Any new PR today, after the TMA PR got merged, would be built on top of the "old base image", where the old CMake cache may cause weird conflicts with the new PR.

I suggest that we wait until tomorrow when the new base image, built from main, contains the TMA PR. Those cuTensor related dependency issue should be resolved by then. I'll take a look if CI jobs still have this issue tomorrow and I have some rough idea on where could go wrong.

Ccache is used in our test_container build but not in codegen_diff jobs or manylinux_wheel builds. 🙂

@zasdfgbnm
Copy link
Collaborator

@zasdfgbnm , hey, I guess the cause of those build issues in codegen diff jobs is that, the base image we used to build your PR was from last night where the TMA PR was not merged to main yet. Any new PR today, after the TMA PR got merged, would be built on top of the "old base image", where the old CMake cache may cause weird conflicts with the new PR.

I suggest that we wait until tomorrow when the new base image, built from main, contains the TMA PR. Those cuTensor related dependency issue should be resolved by then. I'll take a look if CI jobs still have this issue tomorrow and I have some rough idea on where could go wrong.

Ccache is used in our test_container build but not in codegen_diff jobs or manylinux_wheel builds. 🙂

Sounds like a valid theory. Thanks!

@liqiangxl liqiangxl merged commit b96d569 into main Sep 22, 2023
@liqiangxl liqiangxl deleted the llu/enable_outer_scheduler branch September 22, 2023 11:05
liqiangxl added a commit that referenced this pull request Sep 22, 2023
@liqiangxl liqiangxl restored the llu/enable_outer_scheduler branch September 22, 2023 13:26
@xwang233
Copy link
Collaborator

I think the missing dependency issue in compilation is due to too many concurrency, e.g., 255 jobs there, in the build. The concurrency algorithm (MAX_JOBS) might be fine for local build, but doesn't work that well in servers with large memory and multiple containers running simultaneously. I've reduced the building concurrency in CI codegen diff jobs. Hope this can fix the build issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants