-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Following the instructions in your repository, I cloned the code and executed the w2g64.sh script for the Llama‑2‑7B model. I used 4096 RedPajama samples with a context length of 2048, batch size 2 and two epochs as specified in Section 4.1. According to Table 7 of the paper, the 7 B model should complete the Block‑AP phase in about 3.3 hours with ~8.5 GB of memory.
However, on my system (one NVIDIA H100 94 GB GPU and 32 CPU cores with 4GB memory each), the training is significantly slower. Each block’s first epoch in layer 0 is taking more than 4 minutes, and overall GPU utilization fluctuates between 0 % and 90 %. I suspect that there may be a CPU–GPU bottleneck during the epoch loop—despite installing all packages at the exact versions listed in your repository.
Could you advise whether there are additional optimizations or configurations needed to match your reported performance? I would also appreciate any guidance on what I might have missed in replicating the Block‑AP training procedure.
@ChenMnZ