Distributed terashuf

I don't know how much time I will have to polish this up, but I have a prototype MPI-enabled "terashuf".  This uses an algorithm similar to terashuf, where contiguous records from the source file are shuffled in segments, and then one randomly picks leading records from the shuffled segments to put together the final shuffled file.

This prototype currently stores the shuffled segments in memory (rather than files), and so it requires one to be able to load the full file into distributed memory.  Currently each rank reads a portion of the source file into memory, shuffles that section, and then ranks exchange lines with each other in order to write out the file in contiguous chunks.

It can shuffle the oscar.jsonl file in about 10 minutes using 80 procs on 8 nodes on my system.
```
2021-09-01T12:19:15: 0: Wrote 1319979521428 of 1320971843503 bytes (99.92%) in 343 secs, 3668.675 MB/s
2021-09-01T12:19:20: 0: Waiting for ranks to finish ...
2021-09-01T12:19:20: Seconds to write file: 348.45524168014526

real	6m25.041s
```
Just posting this notice in case others need to shuffle a large JSON file in a hurry.

https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/tools/distshuf.py

It currently requires ``mpi4py`` and an mpi4py enabled ``DistData`` class.

https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/megatron/data/distdata_mpi.py

I first attempted a ``torch.distributed`` version, but hit some problems.  I haven't yet gone back to see if a ``torch.dist`` equivalent is easy.

For speed and correctness, both the input and output files must be on a parallel file system like Lustre/GPFS.

Example command:
```
srun -n 80 -N 8 python3 tools/distshuf.py \
       --input /gpfs/path/to/oscar.jsonl \
       --output /gpfs/path/to/oscarshuf.jsonl \
       --seed 101
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed terashuf #91

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distributed terashuf #91

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions