Skip to content

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Mar 10, 2023

This PR aims to fix the following issues described in #2811 and #2812.

  • Nested zero.Init() causes infinite recursion
  • zero.Init() does not properly partition parameters of a model class that is dynamically defined inside the context

According to the discussion on #2812, this change enables zero.Init() to detect dynamically defined classes and properly sets wrappers to the constructors of the classes. This also manages the hierarchical contexts of zero.Init().

@stas00
Copy link
Collaborator

stas00 commented Mar 25, 2023

Thank you for your patience - this looks good, @tohtana - I did some testing and it works well for the cases I tried.

Now something this sensitive of a change can't be merged w/o adding comprehensive tests. In both issues I have supplied ready to use the tests already, so it should be trivial to integrate those. But also think if perhaps some of my tests don't cover something cases, so please add those as well.

I think these tests would go the best here:
tests/unit/runtime/zero/test_zero_context_ancestry.py
as they cover similar functionality. or perhaps in some other related test file.

Thank you again, this is a really important PR.

@stas00
Copy link
Collaborator

stas00 commented Apr 12, 2023

I don't think rebasing is going to help if the tests added in this PR fail

As the error is:

   File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/tests/unit/runtime/zero/test_zero_context_ancestry.py", line 141, in test_new_class_declared_inside_init
    with deepspeed.zero.Init(config_dict_or_path=ds_config):
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 728, in __init__
    init_distributed()
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 576, in init_distributed
    mpi_discovery(distributed_port=distributed_port, verbose=verbose)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 595, in mpi_discovery
    from mpi4py import MPI
ModuleNotFoundError: No module named 'mpi4py'

it usually means the test isn't set up properly.

If help is needed please let me know.

@tohtana tohtana enabled auto-merge (squash) April 13, 2023 16:12
@tohtana tohtana merged commit 717c302 into master Apr 14, 2023
@tohtana tohtana deleted the tohtana/nested_zero_init branch April 18, 2023 16:07
@tjruwase
Copy link
Contributor

Fix #2812 and #2811

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants