Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

segfault on mx1.8-cu110 with python3.7 #19556

@waytrue17

Description

@waytrue17

Description

Running mxnet-horovod example incubator-mxnet/example/distributed_training-horovod/gluon_mnist.py on mxnet1.8-cuda11.0 with python 3.7 encountered a segfault error. The error occurred after the example script finished.
The same script works fine on mxnet1.8-cuda10.2 with python 3.7 and mxnet1.8-cuda11.0 with python 3.6.

To Reproduce

Steps to reproduce

  1. Launch an EC2 p3.8x gpu instance with dlami: ami-02440419a5afe47ab
  2. Build mx1.8-cu110 from source
  3. Install Horovod python3 -m pip install horovod
  4. Run LD_LIBRARY_PATH=/usr/local/cuda-11.0/lib64:$LD_LIBRARY_PATH python3 \ incubator-mxnet/example/distributed_training-horovod/gluon_mnist.py to reproduce the error

What have you tried to solve it?

  1. Backport Remove cleanup on side threads #19378 to v1.8.x solved the issue

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions