Skip to content

Distributed terashuf #91

@adammoody

Description

@adammoody

I don't know how much time I will have to polish this up, but I have a prototype MPI-enabled "terashuf". This uses an algorithm similar to terashuf, where contiguous records from the source file are shuffled in segments, and then one randomly picks leading records from the shuffled segments to put together the final shuffled file.

This prototype currently stores the shuffled segments in memory (rather than files), and so it requires one to be able to load the full file into distributed memory. Currently each rank reads a portion of the source file into memory, shuffles that section, and then ranks exchange lines with each other in order to write out the file in contiguous chunks.

It can shuffle the oscar.jsonl file in about 10 minutes using 80 procs on 8 nodes on my system.

2021-09-01T12:19:15: 0: Wrote 1319979521428 of 1320971843503 bytes (99.92%) in 343 secs, 3668.675 MB/s
2021-09-01T12:19:20: 0: Waiting for ranks to finish ...
2021-09-01T12:19:20: Seconds to write file: 348.45524168014526

real	6m25.041s

Just posting this notice in case others need to shuffle a large JSON file in a hurry.

https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/tools/distshuf.py

It currently requires mpi4py and an mpi4py enabled DistData class.

https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/megatron/data/distdata_mpi.py

I first attempted a torch.distributed version, but hit some problems. I haven't yet gone back to see if a torch.dist equivalent is easy.

For speed and correctness, both the input and output files must be on a parallel file system like Lustre/GPFS.

Example command:

srun -n 80 -N 8 python3 tools/distshuf.py \
       --input /gpfs/path/to/oscar.jsonl \
       --output /gpfs/path/to/oscarshuf.jsonl \
       --seed 101

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions