Block‑AP Training Performance on H100

Following the instructions in your repository, I cloned the code and executed the w2g64.sh script for the Llama‑2‑7B model. I used 4096 RedPajama samples with a context length of 2048, batch size 2 and two epochs as specified in Section 4.1. According to Table 7 of the paper, the 7 B model should complete the Block‑AP phase in about 3.3 hours with ~8.5 GB of memory.

However, on my system (one NVIDIA H100 94 GB GPU and 32 CPU cores with 4GB memory each), the training is significantly slower. Each block’s first epoch in layer 0 is taking more than 4 minutes, and overall GPU utilization fluctuates between 0 % and 90 %. I suspect that there may be a CPU–GPU bottleneck during the epoch loop—despite installing all packages at the exact versions listed in your repository.

Could you advise whether there are additional optimizations or configurations needed to match your reported performance? I would also appreciate any guidance on what I might have missed in replicating the Block‑AP training procedure.
@ChenMnZ 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block‑AP Training Performance on H100 #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Block‑AP Training Performance on H100 #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions