Skip to content

[gemini] gemini supports lazy init#3379

Merged
FrankLeeeee merged 21 commits intohpcaitech:mainfrom
ver217:feature/gemini-lazyinit
Apr 12, 2023
Merged

[gemini] gemini supports lazy init#3379
FrankLeeeee merged 21 commits intohpcaitech:mainfrom
ver217:feature/gemini-lazyinit

Conversation

@ver217
Copy link
Copy Markdown
Contributor

@ver217 ver217 commented Mar 31, 2023

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

Closes #3529

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

  • Fix NVME optimizer
  • Refactor Gemini and make it support ColoInitContext, LazyInitContext and naive initialize.
  • Update Gemini plugin.

As env of CI is not compatible with MetaTensor, lazy init test is skipped. I tested on local machine:

image

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@ver217 ver217 added Run Build and Test gemini related to the gemini feature lazyinit Lazy initialization labels Apr 11, 2023
@ver217 ver217 marked this pull request as ready for review April 11, 2023 10:12
@ver217 ver217 requested review from 1SAA and FrankLeeeee April 11, 2023 11:40
@FrankLeeeee FrankLeeeee merged commit 152239b into hpcaitech:main Apr 12, 2023
@FrankLeeeee FrankLeeeee deleted the feature/gemini-lazyinit branch April 12, 2023 08:03
@kurisusnowdeng
Copy link
Copy Markdown
Contributor

There is an issue caused by importing colossalai

Python 3.9.16 (main, Mar  8 2023, 14:00:05) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import colossalai
Traceback (most recent call last):
  File "/home/.conda/envs/dev/lib/python3.9/site-packages/torch/_ops.py", line 565, in __getattr__
    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator aten::prelu_backward

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ColossalAI/colossalai/__init__.py", line 1, in <module>
    from .initialize import (
  File "/home/ColossalAI/colossalai/initialize.py", line 18, in <module>
    from colossalai.amp import AMP_TYPE, convert_to_amp
  File "/home/ColossalAI/colossalai/amp/__init__.py", line 11, in <module>
    from .apex_amp import convert_to_apex_amp
  File "/home/ColossalAI/colossalai/amp/apex_amp/__init__.py", line 4, in <module>
    from .apex_amp import ApexAMPOptimizer
  File "/home/ColossalAI/colossalai/amp/apex_amp/apex_amp.py", line 13, in <module>
    from colossalai.nn.optimizer import ColossalaiOptimizer
  File "/home/ColossalAI/colossalai/nn/__init__.py", line 1, in <module>
    from ._ops import *
  File "/home/ColossalAI/colossalai/nn/_ops/__init__.py", line 1, in <module>
    from .addmm import colo_addmm
  File "/home/ColossalAI/colossalai/nn/_ops/addmm.py", line 6, in <module>
    from ._utils import GeneralTensor, Number, convert_to_colo_tensor, reduce_grad, reduce_input
  File "/home/ColossalAI/colossalai/nn/_ops/_utils.py", line 7, in <module>
    from colossalai.nn.layer.utils import divide
  File "/home/ColossalAI/colossalai/nn/layer/__init__.py", line 7, in <module>
    from .moe import *
  File "/home/ColossalAI/colossalai/nn/layer/moe/__init__.py", line 1, in <module>
    from .checkpoint import load_moe_model, save_moe_model
  File "/home/ColossalAI/colossalai/nn/layer/moe/checkpoint.py", line 5, in <module>
    from .experts import MoeExperts
  File "/home/ColossalAI/colossalai/nn/layer/moe/experts.py", line 12, in <module>
    from colossalai.zero.legacy.init_ctx import no_shard_zero_decrator
  File "/home/ColossalAI/colossalai/zero/__init__.py", line 1, in <module>
    from .gemini import (
  File "/home/ColossalAI/colossalai/zero/gemini/__init__.py", line 3, in <module>
    from .gemini_ddp import GeminiDDP, ZeroDDP
  File "/home/ColossalAI/colossalai/zero/gemini/gemini_ddp.py", line 17, in <module>
    from colossalai.utils.model.experimental import LazyTensor
  File "/home/ColossalAI/colossalai/utils/model/experimental.py", line 10, in <module>
    from colossalai._analyzer._subclasses import MetaTensor
  File "/home/ColossalAI/colossalai/_analyzer/_subclasses/__init__.py", line 1, in <module>
    from ._meta_registration import *
  File "/home/ColossalAI/colossalai/_analyzer/_subclasses/_meta_registration.py", line 277, in <module>
    aten.prelu_backward.default,
  File "/home/.conda/envs/dev/lib/python3.9/site-packages/torch/_ops.py", line 569, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'aten' object has no attribute 'prelu_backward'

I'm using Pytorch 2.0 & CUDA 11.8

yhna940 added a commit to EleutherAI/oslo that referenced this pull request Apr 19, 2023
…, and In-Place Dist Tensor Conversion (#178)

## Title

- Improve Zero3 Implementation: Search Utility, Consolidation, and
In-Place Dist Tensor Conversion

## Description

This PR aims to improve the zero3 implementation with the following
major changes:

1. Added a search utility for configuring chunk structures.
2. Consolidated zero-related implementations into a single directory
(Motivated by this
[commit](hpcaitech/ColossalAI#3424)).
3. Added a process for converting to custom tensors in-place (Motivated
by this [commit](hpcaitech/ColossalAI#3379)).
4. Unittest

Minor changes include:

1. Instantiation of chunk manager and hetero memory manager within fsdp.
2. Several small bug fixes.

## Linked Issues

- N/A

---------

Co-authored-by: Junhwa Song <ethan9867@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gemini related to the gemini feature lazyinit Lazy initialization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[gemini] gemini supports lazy init

3 participants