Deadlock with ThreadedEngine

## Description
There currently exists some weird behaviour with `unix-gpu` CI jobs, where the build sometimes gets aborted and other times completes fine. I've seen this multiple times on different PRs.

Until today, I thought, that this is caused by limited available GPU executors and the jobs are getting manually aborted or aborted by some automatic priority setup in Jenkins (maybe priority goes to CI/CD for master or something).

However, I've noticed a few weird consistent things about these aborted jobs, so I wanted to make sure, that the current behaviour is intentional. 

1) Of the cases I've seen, `unix-gpu` getting aborted, it's almost always in a situation, where all the build steps and all the other tests were completed, but there is just a single `Python 3: GPU` or `Python 3: GPU (TVM_OP OFF)` test step that was aborted. <details>
![image](https://user-images.githubusercontent.com/3747318/79561369-eeab6c80-80b1-11ea-934c-b849e3d38259.png) ![image](https://user-images.githubusercontent.com/3747318/79561571-3b8f4300-80b2-11ea-87b8-83a5c9c77da0.png) </details>
2) Normally, these steps seem to take around 1 hour to complete. But in the cases, where they were aborted, it was after **3 hours**. Additionally, there is a weird jump in the logs, between the time of the last log message from the test and the first message from shutting down due to the interrupt signal.<details><pre><code>[2020-04-16T14:57:56.542Z] test_operator_gpu.test_np_diag ... ok (2.9642s)
[2020-04-16T14:57:56.797Z] test_operator_gpu.test_np_diag_indices_from ... ok (0.2669s)
[2020-04-16T14:58:00.957Z] test_operator_gpu.test_np_diagflat ... ok (3.5951s)
[2020-04-16T14:58:01.882Z] test_operator_gpu.test_np_diagonal ... ok (1.4132s)
[2020-04-16T14:58:04.397Z] test_operator_gpu.test_np_diff ... ok (2.0127s)
[2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dot ... ok (1.8446s)
[2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dsplit ... ok (0.0832s)
[2020-04-16T14:58:06.013Z] test_operator_gpu.test_np_dstack ... ok (0.0664s)
[2020-04-16T14:58:15.936Z] test_operator_gpu.test_np_ediff1d ... ok (8.6182s)
[2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_einsum ... ok (2.4994s)
[2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_empty ... ok (0.0161s)
[2020-04-16T17:33:48.723Z] Sending interrupt signal to process
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,990 - root - WARNING - Signal 15 received, cleaning up...
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,991 - root - WARNING - Cleaning up containers
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,155 - root - INFO - ☠: stopped container 89c8acd3217d
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - 🚽: removed container 89c8acd3217d
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - Cleaning up containers finished.
[2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - WARNING - done. Exiting with error.
[2020-04-16T17:33:57.549Z] script returned exit code 1
</code></pre></details> These are consecutive log messages, but you can see a huge time skip between `test_operator_gpu.test_np_empty ... ok` at `14:58` and `Sending interrupt signal to process` at `17:33`.

## Occurrences
Here are the 2 examples from the screenshots:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18055/4/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18054/6/pipeline
and a random example, not from my PRs:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18081/1/pipeline

I am aware, that we can just restart the job via mxnet-bot, but this is annoying since the job takes a long time to complete even without this issue. Can somebody clarify, if
1) `unix-gpu` CI jobs getting aborted is intentional (and what is the current policy on aborting CI jobs etc)
2) if it is intentional, is there something we can do to at the very least abort the tests faster or maybe not even fail these jobs, but automatically reschedule them (or preferably reschedule just the aborted step, not the whole pipeline)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deadlock with ThreadedEngine #18090

Description

Occurrences

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deadlock with ThreadedEngine #18090

Description

Description

Occurrences

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions