Skip to content

[BUG]: 使用HybridParallelPlugin做Pipeline并行测试GPU利用率不稳定 #4747

@imgaojun

Description

@imgaojun

🐛 Describe the bug

使用HybridParallelPlugin做Pipeline并行测试GPU利用率不稳定,经常出现低于50%的情况,而Megatron一般都会维持在90%以上。想问下是Pipeline通信这块还没有优化好吗?

代码是基于 https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/llama2/benchmark.py 修改,仅修改了数据load部分。

我的一些参数
plugin = HybridParallelPlugin(tp_size=1,
pp_size=2,
enable_flash_attention=True,
enable_fused_normalization=True,
enable_jit_fused=True,
microbatch_size=2,
precision='bf16',
zero_stage=1)

batch_size=32
context-length= 4096

速度上也较慢,仅4.8samples/sec
image

以下是利用率截图
image

Environment

NCCL version 2.18.1+cuda12.1 ClossalAI3.2 PyTorch2.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions