Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Deadlock with ThreadedEngine #18090

@ruro

Description

@ruro

Description

There currently exists some weird behaviour with unix-gpu CI jobs, where the build sometimes gets aborted and other times completes fine. I've seen this multiple times on different PRs.

Until today, I thought, that this is caused by limited available GPU executors and the jobs are getting manually aborted or aborted by some automatic priority setup in Jenkins (maybe priority goes to CI/CD for master or something).

However, I've noticed a few weird consistent things about these aborted jobs, so I wanted to make sure, that the current behaviour is intentional.

  1. Of the cases I've seen, unix-gpu getting aborted, it's almost always in a situation, where all the build steps and all the other tests were completed, but there is just a single Python 3: GPU or Python 3: GPU (TVM_OP OFF) test step that was aborted.
    Details
    image image
  2. Normally, these steps seem to take around 1 hour to complete. But in the cases, where they were aborted, it was after 3 hours. Additionally, there is a weird jump in the logs, between the time of the last log message from the test and the first message from shutting down due to the interrupt signal.
    Details
    [2020-04-16T14:57:56.542Z] test_operator_gpu.test_np_diag ... ok (2.9642s)
    [2020-04-16T14:57:56.797Z] test_operator_gpu.test_np_diag_indices_from ... ok (0.2669s)
    [2020-04-16T14:58:00.957Z] test_operator_gpu.test_np_diagflat ... ok (3.5951s)
    [2020-04-16T14:58:01.882Z] test_operator_gpu.test_np_diagonal ... ok (1.4132s)
    [2020-04-16T14:58:04.397Z] test_operator_gpu.test_np_diff ... ok (2.0127s)
    [2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dot ... ok (1.8446s)
    [2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dsplit ... ok (0.0832s)
    [2020-04-16T14:58:06.013Z] test_operator_gpu.test_np_dstack ... ok (0.0664s)
    [2020-04-16T14:58:15.936Z] test_operator_gpu.test_np_ediff1d ... ok (8.6182s)
    [2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_einsum ... ok (2.4994s)
    [2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_empty ... ok (0.0161s)
    [2020-04-16T17:33:48.723Z] Sending interrupt signal to process
    [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,990 - root - WARNING - Signal 15 received, cleaning up...
    [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,991 - root - WARNING - Cleaning up containers
    [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,155 - root - INFO - ☠: stopped container 89c8acd3217d
    [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - 🚽: removed container 89c8acd3217d
    [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - Cleaning up containers finished.
    [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - WARNING - done. Exiting with error.
    [2020-04-16T17:33:57.549Z] script returned exit code 1
    These are consecutive log messages, but you can see a huge time skip between test_operator_gpu.test_np_empty ... ok at 14:58 and Sending interrupt signal to process at 17:33.

Occurrences

Here are the 2 examples from the screenshots:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18055/4/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18054/6/pipeline
and a random example, not from my PRs:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18081/1/pipeline

I am aware, that we can just restart the job via mxnet-bot, but this is annoying since the job takes a long time to complete even without this issue. Can somebody clarify, if

  1. unix-gpu CI jobs getting aborted is intentional (and what is the current policy on aborting CI jobs etc)
  2. if it is intentional, is there something we can do to at the very least abort the tests faster or maybe not even fail these jobs, but automatically reschedule them (or preferably reschedule just the aborted step, not the whole pipeline)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions