-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Deadlock with ThreadedEngine #18090
Description
Description
There currently exists some weird behaviour with unix-gpu CI jobs, where the build sometimes gets aborted and other times completes fine. I've seen this multiple times on different PRs.
Until today, I thought, that this is caused by limited available GPU executors and the jobs are getting manually aborted or aborted by some automatic priority setup in Jenkins (maybe priority goes to CI/CD for master or something).
However, I've noticed a few weird consistent things about these aborted jobs, so I wanted to make sure, that the current behaviour is intentional.
- Of the cases I've seen,
unix-gpugetting aborted, it's almost always in a situation, where all the build steps and all the other tests were completed, but there is just a singlePython 3: GPUorPython 3: GPU (TVM_OP OFF)test step that was aborted. - Normally, these steps seem to take around 1 hour to complete. But in the cases, where they were aborted, it was after 3 hours. Additionally, there is a weird jump in the logs, between the time of the last log message from the test and the first message from shutting down due to the interrupt signal.These are consecutive log messages, but you can see a huge time skip between
Details
[2020-04-16T14:57:56.542Z] test_operator_gpu.test_np_diag ... ok (2.9642s)
[2020-04-16T14:57:56.797Z] test_operator_gpu.test_np_diag_indices_from ... ok (0.2669s)
[2020-04-16T14:58:00.957Z] test_operator_gpu.test_np_diagflat ... ok (3.5951s)
[2020-04-16T14:58:01.882Z] test_operator_gpu.test_np_diagonal ... ok (1.4132s)
[2020-04-16T14:58:04.397Z] test_operator_gpu.test_np_diff ... ok (2.0127s)
[2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dot ... ok (1.8446s)
[2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dsplit ... ok (0.0832s)
[2020-04-16T14:58:06.013Z] test_operator_gpu.test_np_dstack ... ok (0.0664s)
[2020-04-16T14:58:15.936Z] test_operator_gpu.test_np_ediff1d ... ok (8.6182s)
[2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_einsum ... ok (2.4994s)
[2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_empty ... ok (0.0161s)
[2020-04-16T17:33:48.723Z] Sending interrupt signal to process
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,990 - root - WARNING - Signal 15 received, cleaning up...
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,991 - root - WARNING - Cleaning up containers
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,155 - root - INFO - ☠: stopped container 89c8acd3217d
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - 🚽: removed container 89c8acd3217d
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - Cleaning up containers finished.
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - WARNING - done. Exiting with error.
[2020-04-16T17:33:57.549Z] script returned exit code 1
test_operator_gpu.test_np_empty ... okat14:58andSending interrupt signal to processat17:33.
Occurrences
Here are the 2 examples from the screenshots:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18055/4/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18054/6/pipeline
and a random example, not from my PRs:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18081/1/pipeline
I am aware, that we can just restart the job via mxnet-bot, but this is annoying since the job takes a long time to complete even without this issue. Can somebody clarify, if
unix-gpuCI jobs getting aborted is intentional (and what is the current policy on aborting CI jobs etc)- if it is intentional, is there something we can do to at the very least abort the tests faster or maybe not even fail these jobs, but automatically reschedule them (or preferably reschedule just the aborted step, not the whole pipeline)

