[RFC] Moving MXNet-AMP to core

MXNet already has experimental AMP (Automatic Mixed Precision) support, exposed in mxnet.contrib package. It is used for automatic casting models to both float16 and bfloat16. This RFC covers moving it into core / making a first-class feature, as well as further development.

Here's a rough task break down for the initial move:

* ~~Need to ensure AMP works with numpy ops - i.e., all ops are in either of the lists~~ - done in https://github.com/apache/incubator-mxnet/pull/19036
* ~~API change: make loss scale public (https://github.com/apache/incubator-mxnet/issues/17507)~~ - done in https://github.com/apache/incubator-mxnet/pull/19036
* ~~Transparent / lazy AMP initialization? (https://github.com/apache/incubator-mxnet/issues/18902#issuecomment-679443104)~~ - a warning added, when amp.init() is called and a model already exists in https://github.com/apache/incubator-mxnet/pull/19036
* A number of issues has to be resolved to improve user experience:
  1. ~~Cannot load trainer with AMP (https://github.com/apache/incubator-mxnet/issues/16858)~~ - fixed in https://github.com/apache/incubator-mxnet/pull/18959
  2. ~~There's a CUDA crash (IMA) in amp_multicast, happens on some models (Yolo3)~~ - fixed in https://github.com/apache/incubator-mxnet/pull/19318
  3. AMP not reusing weights on recursive networks (https://github.com/apache/incubator-mxnet/issues/19019)
* ~~The actual shuffling code around and updating import paths~~

Post move:

1. Layout optimization - upstreaming feature already existing in NVIDIA NGC container. This helps convolutions' performance by automatically casting between NCHW and NHWC layouts.
2. Explore alternatives to front end ops monkey-patching (https://github.com/apache/incubator-mxnet/issues/18697)
3. Add a way for the user to turn AMP off, and to control AMP setting via a context manager.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Moving MXNet-AMP to core #18896

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Moving MXNet-AMP to core #18896

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions