-
Notifications
You must be signed in to change notification settings - Fork 228
Description
I don't know how much time I will have to polish this up, but I have a prototype MPI-enabled "terashuf". This uses an algorithm similar to terashuf, where contiguous records from the source file are shuffled in segments, and then one randomly picks leading records from the shuffled segments to put together the final shuffled file.
This prototype currently stores the shuffled segments in memory (rather than files), and so it requires one to be able to load the full file into distributed memory. Currently each rank reads a portion of the source file into memory, shuffles that section, and then ranks exchange lines with each other in order to write out the file in contiguous chunks.
It can shuffle the oscar.jsonl file in about 10 minutes using 80 procs on 8 nodes on my system.
2021-09-01T12:19:15: 0: Wrote 1319979521428 of 1320971843503 bytes (99.92%) in 343 secs, 3668.675 MB/s
2021-09-01T12:19:20: 0: Waiting for ranks to finish ...
2021-09-01T12:19:20: Seconds to write file: 348.45524168014526
real 6m25.041s
Just posting this notice in case others need to shuffle a large JSON file in a hurry.
https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/tools/distshuf.py
It currently requires mpi4py and an mpi4py enabled DistData class.
https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/megatron/data/distdata_mpi.py
I first attempted a torch.distributed version, but hit some problems. I haven't yet gone back to see if a torch.dist equivalent is easy.
For speed and correctness, both the input and output files must be on a parallel file system like Lustre/GPFS.
Example command:
srun -n 80 -N 8 python3 tools/distshuf.py \
--input /gpfs/path/to/oscar.jsonl \
--output /gpfs/path/to/oscarshuf.jsonl \
--seed 101