-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[BUGFIX][1.8.x] Temporary fix for RNN with oneDNN seg faults/core dumps #19308
Conversation
|
Hey @bgawrych , Thanks for submitting the PR
CI supported jobs: [centos-gpu, centos-cpu, windows-cpu, sanity, website, windows-gpu, clang, unix-gpu, unix-cpu, miscellaneous, edge] Note: |
|
@mxnet-bot run ci [windows-cpu, unix-cpu] |
|
Jenkins CI successfully triggered : [windows-cpu, unix-cpu] |
|
@mxnet-bot run ci [windows-cpu, unix-gpu] |
|
Jenkins CI successfully triggered : [windows-cpu, unix-gpu] |
|
@mxnet-bot run ci [unix-gpu] |
|
Jenkins CI successfully triggered : [unix-gpu] |
|
Hey @szha @samskalicky Do you know what's wrong with unix-gpu CI on 1.8 branch ? |
|
@mxnet-bot run ci [unix-gpu] |
|
Jenkins CI successfully triggered : [unix-gpu] |
|
LGTM |
|
@ciyongch @zixuanweeei please help take a review. |
|
@anko-intel what's the timeline of 1.8? Is it possible to release a new patch version for 1.7? |
According to https://cwiki.apache.org/confluence/display/MXNET/1.8.0+Release+Plan+and+Status November the 1st will be the date of the official release. I think we are unable to prepare 1.7 patch version earlier. |
src/operator/nn/mkldnn/mkldnn_rnn.cc
Outdated
| } | ||
| } | ||
|
|
||
| // Bug in oneDNN >= 1.6 in memory descriptor comparision operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? oneDNN <= 1.6 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, done
src/operator/nn/mkldnn/mkldnn_rnn.cc
Outdated
| // for specific dims and strides in descriptors == operator can return `true` | ||
| // but get_size() function will return different size | ||
| // TODO(bgawrych): Remove with oneDNN 1.7 upgrade | ||
| bool CheckMemDescEquality(const mkldnn::memory::desc &left, const mkldnn::memory::desc &right) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to make this function as static inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done :)
|
@ciyongch @pengzhao-intel @samskalicky Can we merge this change? |
pengzhao-intel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and merging now.
|
Please file an issue to track the oneDNN upgrade and remove the temp solution. @bgawrych |
|
Thanks @bgawrych, can you also backport this PR to the v1.x branch so it stays in sync? |
Description
This fix is workaround for problem with oneDNN memory descriptors comparison (size calculations) which causes segmentation faults in operator destructor - reported issue #19022.
Fix for this issue will be delivered with oneDNN 1.7 - however we would like to fix this issue in 1.8 version as well
Checklist
Essentials
Changes