Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
f1d4167
[moe] removed openmoe-coupled code and rectify mixstral code (#5471)
FrankLeeeee Mar 19, 2024
df6826d
[Feauture] MoE refractor; Intergration with Mixtral (#5682)
Edenzzzz May 29, 2024
d49fd63
add mixtral auto policy & move pipeline forward code to modeling folder
Hz188 May 31, 2024
d2e07fc
[moe refactor] modify kernel test without Route Class
Hz188 Jun 4, 2024
7556b8f
[moe refactor] add moe tensor test path environment variable to githu…
Hz188 Jun 4, 2024
16329d5
fix typos
Hz188 Jun 4, 2024
b934437
fix moe test bug due to the code rebase
Hz188 Jun 5, 2024
a792e83
[moe refactor] fix moe zero test, and little bug in low level zero
Hz188 Jun 6, 2024
d203ba8
fix typo
Hz188 Jun 6, 2024
55c7416
add moe tensor path to github workflow
Hz188 Jun 6, 2024
8915e9d
remove some useless code
Hz188 Jun 6, 2024
7963fb0
fix typo & unify global variable XX_AXIS logic without using -1
Hz188 Jun 7, 2024
32ced74
fix typo & prettifier the code
Hz188 Jun 7, 2024
3100c1b
remove print code & support zero 2 test
Hz188 Jun 7, 2024
928ee39
remove useless code
Hz188 Jun 7, 2024
6dc0cfc
reanme function
Hz188 Jun 7, 2024
4417840
fix typo
Hz188 Jun 7, 2024
eb35655
fix typo
Hz188 Jun 7, 2024
d1d446b
Further improve the test code
Hz188 Jun 7, 2024
09a5188
remove print code
Hz188 Jun 7, 2024
4c6ea42
[moe refactor] change test model from fake moe model to mixtral moe l…
Hz188 Jun 11, 2024
80b6586
[moe refactor] skip some unit test which will be refactored later
Hz188 Jun 11, 2024
7d06220
[moe refactor] fix unit import error
Hz188 Jun 11, 2024
fb41f42
[moe refactor] fix circular import issues
Hz188 Jun 11, 2024
e99b69c
[moe refactor] remove debug code
Hz188 Jun 11, 2024
af9ade6
[moe refactor] update github workflow
Hz188 Jun 12, 2024
49d74f3
Merge pull request #5775 from Hz188/feature/moe
botbw Jun 12, 2024
d71ab10
[moe/zero] refactor low level optimizer (#5767)
botbw Jun 12, 2024
88f318a
[Feature] MoE refactor with newest version of ZeRO (#5801)
Hz188 Jun 12, 2024
b2ac7e5
[zero] remove redundant members in BucketStore (#5802)
botbw Jun 12, 2024
346a0df
[zero] align api with previous version
botbw Jun 13, 2024
a3a7d7d
Merge pull request #5811 from botbw/moe
botbw Jun 14, 2024
ba0115a
[Moe/Zero] Update MoeHybridParallelPlugin with refactored ZeRO and Fi…
Hz188 Jun 14, 2024
a10802e
[hotfix]Solve the compatibility issue of zero refactor (#5823)
Hz188 Jun 17, 2024
4cd4a1f
[zero] fix missing hook removal (#5824)
botbw Jun 17, 2024
729388e
[MoE] Resolve .github conflict (#5829)
Hz188 Jun 19, 2024
d9ea6d4
[zero] fix hook bug
Hz188 Jun 19, 2024
b04e99c
Merge branch 'main' into feature/moe
Hz188 Jun 19, 2024
62cd25d
[zero] add low level optimizer back (#5839)
botbw Jun 20, 2024
204d25c
[zero] comments and naming (#5840)
botbw Jun 20, 2024
efdfa06
[zero] modify api (#5843)
botbw Jun 20, 2024
44aeccc
[test] fix (#5857)
botbw Jun 26, 2024
9398484
[CI] skip openmoe CI check
Hz188 Jun 26, 2024
5e551f8
[CI] fox pre-commit
Hz188 Jun 26, 2024
2ff332c
[zero] remove redundant memebr init (#5862)
botbw Jun 27, 2024
75be843
[misc] remove useless code, modify the pg mesh implementation
Hz188 Jun 27, 2024
1855442
Merge branch 'hpcaitech:feature/moe' into feature/moe
Hz188 Jun 27, 2024
3a25166
[misc] remove useless code, modify the pg mesh implementation
Hz188 Jun 27, 2024
502e514
[misc] use tempfile
Hz188 Jun 27, 2024
494b8a2
resolve conflict with main branch
Hz188 Jun 27, 2024
961e96f
resolve conflict with main branch
Hz188 Jun 27, 2024
95c4c0b
[misc] use tempfile in test_moe_checkpoint.py
Hz188 Jun 27, 2024
9e966b9
[misc] remove useless code, add assertion about sequence parallel, mo…
Hz188 Jun 28, 2024
165e894
[misc] remove useless code
Hz188 Jun 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/build_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ jobs:
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
options: --gpus all --rm -v /dev/shm -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
options: --gpus all --rm -v /dev/shm -v /data/scratch:/data/scratch
timeout-minutes: 90
defaults:
run:
Expand Down Expand Up @@ -165,6 +165,7 @@ jobs:
env:
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
LLAMA_PATH: /data/scratch/llama-tiny
MOE_TENSOR_PATH: /data/scratch/moe_tensors

- name: Collate artifact
env:
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/build_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ jobs:
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
options: --gpus all --rm -v /dev/shm -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
options: --gpus all --rm -v /dev/shm -v /data/scratch/:/data/scratch/
timeout-minutes: 90
steps:
- name: Check GPU Availability # ensure all GPUs have enough memory
Expand Down Expand Up @@ -69,6 +69,7 @@ jobs:
env:
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
LLAMA_PATH: /data/scratch/llama-tiny
MOE_TENSOR_PATH: /data/scratch/moe_tensors
Comment thread
Hz188 marked this conversation as resolved.

- name: Notify Lark
id: message-preparation
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/compatiblity_test_on_dispatch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jobs:
matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}}
container:
image: ${{ matrix.container }}
options: --gpus all --rm -v /dev/shm -v /data/scratch/cifar-10:/data/scratch/cifar-10 -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
options: --gpus all --rm -v /dev/shm -v /data/scratch/:/data/scratch/
timeout-minutes: 200
steps:
- name: Install dependencies
Expand Down Expand Up @@ -92,3 +92,4 @@ jobs:
DATA: /data/scratch/cifar-10
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
LLAMA_PATH: /data/scratch/llama-tiny
MOE_TENSOR_PATH: /data/scratch/moe_tensors
Comment thread
Hz188 marked this conversation as resolved.
3 changes: 2 additions & 1 deletion .github/workflows/compatiblity_test_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}}
container:
image: ${{ matrix.container }}
options: --gpus all --rm -v /dev/shm -v /data/scratch/cifar-10:/data/scratch/cifar-10 -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
options: --gpus all --rm -v /dev/shm -v /data/scratch/:/data/scratch/
timeout-minutes: 200
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-run-test-${{ matrix.container }}
Expand Down Expand Up @@ -87,3 +87,4 @@ jobs:
DATA: /data/scratch/cifar-10
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
LLAMA_PATH: /data/scratch/llama-tiny
MOE_TENSOR_PATH: /data/scratch/moe_tensors
Comment thread
Hz188 marked this conversation as resolved.
3 changes: 2 additions & 1 deletion .github/workflows/compatiblity_test_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}}
container:
image: ${{ matrix.container }}
options: --gpus all --rm -v /dev/shm -v /data/scratch/cifar-10:/data/scratch/cifar-10 -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
options: --gpus all --rm -v /dev/shm -v /data/scratch/:/data/scratch/
timeout-minutes: 200
steps:
- name: Install dependencies
Expand Down Expand Up @@ -85,6 +85,7 @@ jobs:
DATA: /data/scratch/cifar-10
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
LLAMA_PATH: /data/scratch/llama-tiny
MOE_TENSOR_PATH: /data/scratch/moe_tensors
Comment thread
Hz188 marked this conversation as resolved.

- name: Notify Lark
id: message-preparation
Expand Down
Empty file.
Empty file.
92 changes: 0 additions & 92 deletions applications/ColossalMoE/colossal_moe/models/mixtral_layer.py

This file was deleted.

4 changes: 0 additions & 4 deletions applications/ColossalMoE/infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

import torch
import torch.distributed as dist
from colossal_moe.models.mixtral_checkpoint import MixtralMoEHybridParallelCheckpointIO
from colossal_moe.models.mixtral_policy import MixtralForCausalLMPolicy
from transformers import AutoTokenizer
from transformers.models.mixtral import MixtralConfig, MixtralForCausalLM

Expand Down Expand Up @@ -70,8 +68,6 @@ def main():
ep_size=ep_size,
zero_stage=1,
precision=args.precision,
custom_policy=MixtralForCausalLMPolicy(),
checkpoint_io=MixtralMoEHybridParallelCheckpointIO,
enable_fused_normalization=args.use_layernorm_kernel,
enable_jit_fused=args.use_kernel,
)
Expand Down
3 changes: 2 additions & 1 deletion applications/ColossalMoE/infer.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
NUM_GPU=2
MODEL="mistralai/Mixtral-8x7B-v0.1"
# MODEL="mistralai/Mixtral-8x7B-v0.1"
MODEL="mistralai/Mixtral-8x7B-Instruct-v0.1"

# ep
torchrun --standalone --nproc_per_node $NUM_GPU infer.py \
Expand Down
146 changes: 0 additions & 146 deletions applications/ColossalMoE/tests/test_moe_checkpoint.py

This file was deleted.

6 changes: 1 addition & 5 deletions applications/ColossalMoE/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,11 @@

import torch
import torch.distributed as dist
from colossal_moe.models.mixtral_checkpoint import MixtralMoEHybridParallelCheckpointIO
from colossal_moe.models.mixtral_policy import MixtralForCausalLMPolicy
from colossal_moe.utils import load_checkpoint, move_to_cuda, save_checkpoint
from torch.utils.data import Dataset
from tqdm import tqdm
from transformers import AutoTokenizer
from transformers.models.mixtral import MixtralForCausalLM
from utils import load_checkpoint, move_to_cuda, save_checkpoint

import colossalai
from colossalai.booster import Booster
Expand Down Expand Up @@ -155,12 +153,10 @@ def main():
pp_size=args.pp_size,
ep_size=args.ep_size,
microbatch_size=args.microbatch_size,
custom_policy=MixtralForCausalLMPolicy(),
enable_fused_normalization=args.use_layernorm_kernel,
enable_jit_fused=args.use_kernel,
precision=args.precision,
zero_stage=args.zero_stage,
checkpoint_io=MixtralMoEHybridParallelCheckpointIO,
)

else:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
print(resp) # super-heavyweight awesome-natured yawning Australian creature!

"""

import json
from typing import Any, Mapping

Expand Down
Loading