Skip to content

Block‑AP Training Performance on H100 #34

@bimalmagar10

Description

@bimalmagar10

Following the instructions in your repository, I cloned the code and executed the w2g64.sh script for the Llama‑2‑7B model. I used 4096 RedPajama samples with a context length of 2048, batch size 2 and two epochs as specified in Section 4.1. According to Table 7 of the paper, the 7 B model should complete the Block‑AP phase in about 3.3 hours with ~8.5 GB of memory.

However, on my system (one NVIDIA H100 94 GB GPU and 32 CPU cores with 4GB memory each), the training is significantly slower. Each block’s first epoch in layer 0 is taking more than 4 minutes, and overall GPU utilization fluctuates between 0 % and 90 %. I suspect that there may be a CPU–GPU bottleneck during the epoch loop—despite installing all packages at the exact versions listed in your repository.

Could you advise whether there are additional optimizations or configurations needed to match your reported performance? I would also appreciate any guidance on what I might have missed in replicating the Block‑AP training procedure.
@ChenMnZ

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions