Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Conversation

@TaoLv
Copy link
Member

@TaoLv TaoLv commented Jul 28, 2020

Description

(Brief description on what this PR is about)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@mxnet-bot
Copy link

Hey @TaoLv , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-cpu, sanity, unix-gpu, unix-cpu, website, windows-cpu, centos-gpu, clang, edge, miscellaneous, windows-gpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@TaoLv
Copy link
Member Author

TaoLv commented Aug 1, 2020

The new library doesn't compile with GCC 7.3.1 used in the CI of master branch. But it does compile in PR #18822 to v1.x branch where the CI uses GCC 8. @ChaiBapchya @leezu

@ChaiBapchya
Copy link
Contributor

@leezu any guidance on unblocking the centos pipelines which fail coz of the GCC version
Curious why

  1. master centos has older GCC than v1.x?
  2. GCC version mismatch b/w centos vs unix pipelines? Is it intentional

@leezu
Copy link
Contributor

leezu commented Aug 4, 2020

1.x branch tests gcc 8 as "some newer supported gcc version", whereas master branch tests gcc 7 as "oldest supported gcc version". As long as we support gcc7, we should ensure that we can compile with gcc7

@leezu
Copy link
Contributor

leezu commented Aug 4, 2020

@mxnet-bot run ci [centos-cpu, centos-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-cpu, centos-gpu]

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pengzhao-intel
Copy link
Contributor

1.x branch tests gcc 8 as "some newer supported gcc version", whereas master branch tests gcc 7 as "oldest supported gcc version". As long as we support gcc7, we should ensure that we can compile with gcc7

@bgawrych could you try gcc7?

@TaoLv
Copy link
Member Author

TaoLv commented Aug 5, 2020

The error log is:

[2020-08-04T20:48:43.355Z] ../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp:111:13:   required from 'void dnnl::impl::cpu::x64::jit_avx512_core_amx_1x1_convolution_fwd_t<src_type, wei_type, dst_type>::execute_forward(const dnnl::impl::exec_ctx_t&) const [with dnnl_data_type_t src_type = (dnnl_data_type_t)5; dnnl_data_type_t wei_type = (dnnl_data_type_t)5; dnnl_data_type_t dst_type = (dnnl_data_type_t)6]'

[2020-08-04T20:48:43.355Z] ../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp:192:17:   required from here

[2020-08-04T20:48:43.355Z] ../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp:151:28: internal compiler error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1705
[2020-08-04T20:48:43.355Z]                      size_t dst_offset = (is_1d) ? dst_d.blk_off(mb, oc, ow)
[2020-08-04T20:48:43.355Z]                             ^~~~~~~~~~
[2020-08-04T20:48:43.355Z] Please submit a full bug report,
[2020-08-04T20:48:43.355Z] with preprocessed source if appropriate.
[2020-08-04T20:48:43.355Z] See <http://bugzilla.redhat.com/bugzilla> for instructions.
[2020-08-04T20:48:43.355Z] Preprocessed source stored into /tmp/ccn2o87g.out file, please attach this to your bugreport.

It's a known issue in GCC 7: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83204 and was addressed in GCC 8.

@bgawrych
Copy link
Contributor

bgawrych commented Aug 5, 2020

@pengzhao-intel

1.x branch tests gcc 8 as "some newer supported gcc version", whereas master branch tests gcc 7 as "oldest supported gcc version". As long as we support gcc7, we should ensure that we can compile with gcc7

@bgawrych could you try gcc7?

On Ubuntu 18.04.4 LTS with gcc 7.5 MXNet is succesfully build.
CI is using: gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5))
I've tried to install gcc 7.3 on my ubuntu machine, but couldn't done that (there is no such version in apt repository). Built from sources gcc doesn't work with cmake (couldn't compile simple test program)

@leezu
Copy link
Contributor

leezu commented Aug 5, 2020

@bgawrych Ubuntu bumped GCC 7 version to 7.4 in April 2019. You can still install the older version from the archives to verify, as the older files are still present in the apt repository (see http://security.ubuntu.com/ubuntu/pool/main/g/gcc-defaults/). Just run apt or aptitude and install the older version.

Are you sure this is related to bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83204? That bug is, according to the bug tracker, only fixed in GCC 8 but not in GCC 7. It's not listed as part of the fixes for 7.4 and 7.5 (https://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&list_id=276490&resolution=FIXED&target_milestone=7.4 and https://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&list_id=276490&resolution=FIXED&target_milestone=7.5)

Did your "On Ubuntu 18.04.4 LTS with gcc 7.5 MXNet is succesfully build." run with the master branch or the 1.x branch?

In any case, CentOS CI uses the GCC 7 from https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/ We can report a bug and ask them to update to GCC 7.5 instead of 7.3. I opened https://bugzilla.redhat.com/show_bug.cgi?id=1866477

@bgawrych
Copy link
Contributor

bgawrych commented Aug 6, 2020

@TaoLv @leezu
I've built GCC from sources:
GCC 7.3.0 shows following error in multiple files:

[55/782] Building CXX object 3rdparty/mkldnn/src/cpu/CMakeFiles/dnnl_cpu.dir/cpu_reorder.cpp.o
FAILED: 3rdparty/mkldnn/src/cpu/CMakeFiles/dnnl_cpu.dir/cpu_reorder.cpp.o 
/home/wihajster/Desktop/install_gcc/bin/g++  -DDMLC_LOG_STACK_TRACE_SIZE=0 -DDMLC_MODERN_THREAD_LOCAL=0 -DDMLC_STRICT_CXX11 -DDMLC_USE_CXX11 -DDMLC_USE_CXX14 -DDNNL_ENABLE_CONCURRENT_EXEC -DDNNL_ENABLE_MAX_CPU_ISA -DDNNL_X64=1 -DMSHADOW_IN_CXX11 -D_GLIBCXX_USE_CXX11_ABI=1 -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -I../3rdparty/mkldnn/include -I3rdparty/mkldnn/include -I../3rdparty/mkldnn/src -D_GLIBCXX_ASSERTIONS  -Wall -Wno-sign-compare -O0 -g -std=c++11 -fopenmp -fvisibility-inlines-hidden  -Wall -Wno-unknown-pragmas -fvisibility=internal  -fPIC -Wformat -Wformat-security -fstack-protector-strong  -Wmissing-field-initializers  -Wno-strict-overflow  -g -fPIC   -std=gnu++1z -MD -MT 3rdparty/mkldnn/src/cpu/CMakeFiles/dnnl_cpu.dir/cpu_reorder.cpp.o -MF 3rdparty/mkldnn/src/cpu/CMakeFiles/dnnl_cpu.dir/cpu_reorder.cpp.o.d -o 3rdparty/mkldnn/src/cpu/CMakeFiles/dnnl_cpu.dir/cpu_reorder.cpp.o -c ../3rdparty/mkldnn/src/cpu/cpu_reorder.cpp
In file included from ../3rdparty/mkldnn/src/cpu/cpu_reorder.cpp:27:0:
../3rdparty/mkldnn/src/cpu/rnn/rnn_reorders.hpp: In lambda function:
../3rdparty/mkldnn/src/cpu/rnn/rnn_reorders.hpp:124:9: sorry, unimplemented: unexpected AST of kind omp_simd
         });
         ^

GCC 7.4.0 and 7.5.0 (from sources) shows:

[229/782] Building CXX object 3rdparty/mkldnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_avx512_core_amx_1x1_convolution.cpp.o
FAILED: 3rdparty/mkldnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_avx512_core_amx_1x1_convolution.cpp.o 
/home/wihajster/Desktop/install_gcc/bin/g++  -DDMLC_LOG_STACK_TRACE_SIZE=0 -DDMLC_MODERN_THREAD_LOCAL=0 -DDMLC_STRICT_CXX11 -DDMLC_USE_CXX11 -DDMLC_USE_CXX14 -DDNNL_ENABLE_CONCURRENT_EXEC -DDNNL_ENABLE_JIT_PROFILING=0 -DDNNL_ENABLE_MAX_CPU_ISA -DDNNL_X64=1 -DMSHADOW_IN_CXX11 -D_GLIBCXX_USE_CXX11_ABI=1 -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -I../3rdparty/mkldnn/include -I3rdparty/mkldnn/include -I../3rdparty/mkldnn/src -D_GLIBCXX_ASSERTIONS  -Wall -Wno-sign-compare -O0 -g -std=c++11 -fopenmp -fvisibility-inlines-hidden  -Wall -Wno-unknown-pragmas -fvisibility=internal  -fPIC -Wformat -Wformat-security -fstack-protector-strong  -Wmissing-field-initializers  -Wno-strict-overflow  -g -fPIC   -std=gnu++1z -MD -MT 3rdparty/mkldnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_avx512_core_amx_1x1_convolution.cpp.o -MF 3rdparty/mkldnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_avx512_core_amx_1x1_convolution.cpp.o.d -o 3rdparty/mkldnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_avx512_core_amx_1x1_convolution.cpp.o -c ../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp
../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp: In instantiation of 'dnnl::impl::cpu::x64::jit_avx512_core_amx_1x1_convolution_fwd_t<src_type, wei_type, dst_type>::execute_forward(const dnnl::impl::exec_ctx_t&) const::<lambda(int, int)> [with dnnl_data_type_t src_type = (dnnl_data_type_t)5; dnnl_data_type_t wei_type = (dnnl_data_type_t)5; dnnl_data_type_t dst_type = (dnnl_data_type_t)6]':
../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp:159:29:   required from 'struct dnnl::impl::cpu::x64::jit_avx512_core_amx_1x1_convolution_fwd_t<src_type, wei_type, dst_type>::execute_forward(const dnnl::impl::exec_ctx_t&) const [with dnnl_data_type_t src_type = (dnnl_data_type_t)5; dnnl_data_type_t wei_type = (dnnl_data_type_t)5; dnnl_data_type_t dst_type = (dnnl_data_type_t)6]::<lambda(int, int)>'
../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp:111:13:   required from 'void dnnl::impl::cpu::x64::jit_avx512_core_amx_1x1_convolution_fwd_t<src_type, wei_type, dst_type>::execute_forward(const dnnl::impl::exec_ctx_t&) const [with dnnl_data_type_t src_type = (dnnl_data_type_t)5; dnnl_data_type_t wei_type = (dnnl_data_type_t)5; dnnl_data_type_t dst_type = (dnnl_data_type_t)6]'
../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp:192:17:   required from here
../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp:151:28: internal compiler error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1739
                     size_t dst_offset = (is_1d) ? dst_d.blk_off(mb, oc, ow)
                            ^~~~~~~~~~

But as I've mentioned earlier - GCC 7.5.0 provided by ubuntu doesn't trigger this error.
GCC 8.1.0 built from sources also compile sucessfully.

I've checked also if standalone oneDNN is compiling with GCC 7.4.0 and it is.
This compiler error is probably triggered when oneDNN is compiled with c++14 standard. With c++11 everything works fine

@TaoLv
Copy link
Member Author

TaoLv commented Aug 6, 2020

MXNet is building with C++ 17 standard: https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L10.

@TaoLv
Copy link
Member Author

TaoLv commented Aug 6, 2020

@vpirogov Could you please share more information about the compiler issue?

@vpirogov
Copy link

vpirogov commented Aug 6, 2020

@TaoLv, I do not have anything to add to information in GCC Bugzilla.

@leezu
Copy link
Contributor

leezu commented Aug 6, 2020

But as I've mentioned earlier - GCC 7.5.0 provided by ubuntu doesn't trigger this error.

Thank you for confirming this @bgawrych. Actually Ubuntu backported the fix to their GCC 7 version:

gcc-7 (7.3.0-3ubuntu1) bionic; urgency=medium

  * Update to SVN 20180209 (r257526) from the gcc-7-branch.
    - S/390: Disable branch prediction for indirect branches.
    - Fix PR tree-optimization/84233 (closes: #889724),
      PR target/PR84295 (s390), PR target/84113 (PPC).
  * Fix PR c++/83204, taken from the trunk. Closes: #882855.

One option is to require GCC8 on non-Ubuntu machines. WDYT? For the CI, we just need to replace all occurrences of devtoolset-7 with devtoolset-8 in https://github.com/apache/incubator-mxnet/blob/95fa63f765f33b13c29c7b1c0822d2502eb16721/ci/docker/runtime_functions.sh and https://github.com/apache/incubator-mxnet/blob/a3eabf0d29eaeb03766dd52ef403650a4804f369/ci/docker/Dockerfile.build.centos7

But before that, @bgawrych @vpirogov could you please try adding the target_compile_features(dnnl PUBLIC cxx_std_11) line here:

https://github.com/apache/incubator-mxnet/blob/a3eabf0d29eaeb03766dd52ef403650a4804f369/CMakeLists.txt#L278-L279

Thereby we use CXX11 standard to compile DNNL, which should avoid the bug in gcc7.

@TaoLv TaoLv requested review from leezu and szha as code owners August 7, 2020 02:23
@TaoLv
Copy link
Member Author

TaoLv commented Aug 7, 2020

Not sure why it has both -std=c++11 and -std=gnu++1z in the build line.

[2020-08-07T02:34:38.577Z] /usr/local/bin/ccache /opt/rh/devtoolset-7/root/usr/bin/g++  -DDMLC_LOG_STACK_TRACE_SIZE=0 -DDMLC_MODERN_THREAD_LOCAL=0 -DDMLC_STRICT_CXX11 -DDMLC_USE_CXX11 -DDMLC_USE_CXX14 -DDNNL_ENABLE_CONCURRENT_EXEC -DDNNL_ENABLE_JIT_PROFILING=0 -DDNNL_ENABLE_MAX_CPU_ISA -DDNNL_X64=1 -DMSHADOW_IN_CXX11 -DNDEBUG=1 -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -I../3rdparty/mkldnn/include -I3rdparty/mkldnn/include -I../3rdparty/mkldnn/src -fPIC -mno-avx -fPIC -mno-avx -Wall -Wno-sign-compare -O3 -std=c++11 -fopenmp -fvisibility-inlines-hidden  -Wall -Wno-unknown-pragmas -fvisibility=internal  -fPIC -Wformat -Wformat-security -fstack-protector-strong  -Wmissing-field-initializers  -Wno-strict-overflow  -fPIC   -std=gnu++1z -MD -MT 3rdparty/mkldnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_avx512_core_amx_1x1_convolution.cpp.o -MF 3rdparty/mkldnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_avx512_core_amx_1x1_convolution.cpp.o.d -o 3rdparty/mkldnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_avx512_core_amx_1x1_convolution.cpp.o -c ../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp
[2020-08-07T02:34:38.577Z] ../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp: In instantiation of 'dnnl::impl::cpu::x64::jit_avx512_core_amx_1x1_convolution_fwd_t<src_type, wei_type, dst_type>::execute_forward(const dnnl::impl::exec_ctx_t&) const::<lambda(int, int)> [with dnnl_data_type_t src_type = (dnnl_data_type_t)5; dnnl_data_type_t wei_type = (dnnl_data_type_t)5; dnnl_data_type_t dst_type = (dnnl_data_type_t)6]':
[2020-08-07T02:34:38.577Z] ../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp:159:29:   required from 'struct dnnl::impl::cpu::x64::jit_avx512_core_amx_1x1_convolution_fwd_t<src_type, wei_type, dst_type>::execute_forward(const dnnl::impl::exec_ctx_t&) const [with dnnl_data_type_t src_type = (dnnl_data_type_t)5; dnnl_data_type_t wei_type = (dnnl_data_type_t)5; dnnl_data_type_t dst_type = (dnnl_data_type_t)6]::<lambda(int, int)>'
[2020-08-07T02:34:38.577Z] ../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp:111:13:   required from 'void dnnl::impl::cpu::x64::jit_avx512_core_amx_1x1_convolution_fwd_t<src_type, wei_type, dst_type>::execute_forward(const dnnl::impl::exec_ctx_t&) const [with dnnl_data_type_t src_type = (dnnl_data_type_t)5; dnnl_data_type_t wei_type = (dnnl_data_type_t)5; dnnl_data_type_t dst_type = (dnnl_data_type_t)6]'
[2020-08-07T02:34:38.577Z] ../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp:192:17:   required from here
[2020-08-07T02:34:38.577Z] ../3rdparty/mkldnn/src/cpu/x64/jit_avx512_core_amx_1x1_convolution.cpp:151:28: internal compiler error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1705
[2020-08-07T02:34:38.577Z]                      size_t dst_offset = (is_1d) ? dst_d.blk_off(mb, oc, ow)
[2020-08-07T02:34:38.577Z]                             ^~~~~~~~~~
[2020-08-07T02:34:38.577Z] Please submit a full bug report,
[2020-08-07T02:34:38.577Z] with preprocessed source if appropriate.
[2020-08-07T02:34:38.577Z] See <http://bugzilla.redhat.com/bugzilla> for instructions.

@leezu
Copy link
Contributor

leezu commented Aug 7, 2020

One idea is to remove the global C++17 standard setting and instead set the C++17 only on the mxnet (and related, eg unittest) target. You may need to try a bit locally.

@kpuatamazon
Copy link
Contributor

From the release notes of https://github.com/oneapi-src/oneDNN/releases/tag/v1.6.1 it looks like v1.6.1 is mostly bugfixes of v.1.6. Would you mind bumping to v1.6.1?

@pengzhao-intel
Copy link
Contributor

From the release notes of https://github.com/oneapi-src/oneDNN/releases/tag/v1.6.1 it looks like v1.6.1 is mostly bugfixes of v.1.6. Would you mind bumping to v1.6.1?

Yes, we need to change to 1.6.1 @TaoLv

@pengzhao-intel
Copy link
Contributor

@anko-intel please create a new PR to update oneDNN. I am closing this one.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants