WIP: distributed terashuf by adammoody · Pull Request #92 · bigscience-workshop/Megatron-DeepSpeed

adammoody · 2021-09-02T20:35:04Z

This is solid enough that I'll go ahead and post a WIP PR. It's based on #60, so this will look noisy until that PR is merged. Most of the changes are the distshuf file, for which I've got a link below.

I don't know how much time I will have to polish this up, but I have a prototype MPI-enabled "terashuf". This uses an algorithm similar to terashuf, where contiguous records from the source file are shuffled in segments, and then one randomly picks leading records from the shuffled segments to put together the final shuffled file.

This prototype currently stores the shuffled segments in memory (rather than files), and so it requires one to be able to load the full file into distributed memory. Currently each rank reads a portion of the source file into memory, shuffles that section, and then ranks exchange lines with each other in order to write out the file in contiguous chunks.

It can shuffle the oscar.jsonl file in about 10 minutes using 80 procs on 8 nodes on my system.

2021-09-01T12:19:15: 0: Wrote 1319979521428 of 1320971843503 bytes (99.92%) in 343 secs, 3668.675 MB/s
2021-09-01T12:19:20: 0: Waiting for ranks to finish ...
2021-09-01T12:19:20: Seconds to write file: 348.45524168014526

real	6m25.041s

Just posting this notice in case others need to shuffle a large JSON file in a hurry.

https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/tools/distshuf.py

It currently requires mpi4py and an mpi4py enabled DistData class.

https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/megatron/data/distdata_mpi.py

I first attempted a torch.distributed version, but hit some problems. I haven't yet gone back to see if a torch.dist equivalent is easy.

For speed and correctness, both the input and output files must be on a parallel file system like Lustre/GPFS.

Example command:

srun -n 80 -N 8 python3 tools/distshuf.py \
       --input /gpfs/path/to/oscar.jsonl \
       --output /gpfs/path/to/oscarshuf.jsonl \
       --seed 101

Update 2021-09-02:
Took a pass at using numpy to optimize performance a bit more. The tool currently prints a timing breakdown of its major operations as it goes, and the current number of seconds for each step in one phase picked at random looks like:

2021-09-02T13:05:07: Wrote 972550634051 of 1320971843503 bytes (73.62%) in 156 secs, 5920.115 MB/s
2021-09-02T13:05:11:   bcast 0.5990447998046875
2021-09-02T13:05:11:   ident 0.46723270416259766
2021-09-02T13:05:11:   pack  1.1431987285614014
2021-09-02T13:05:11:   exch  6.822070360183716
2021-09-02T13:05:11:   write 3.847602367401123

In each step in this particular run, each rank gathers 100_000 samples which are each about 5000 bytes on average. This is using 320 procs on 8 nodes. So the total data being processed in each step is about 100_000 * 5000 * 320 = 149 GiB. The data movement portions are pack, exchange, and write. Converting those component times in seconds to bandwidths:

2021-09-02T13:05:11:   pack  1.1431987285614014  = 130 GiB/s
2021-09-02T13:05:11:   exch  6.822070360183716 = 21.8 GiB/s
2021-09-02T13:05:11:   write 3.847602367401123 = 38.8 GiB/s
total effective write bandwidth: 149 GB / (1.14 + 6.82 + 3.85) sec = 12.6 GiB/s

Based on system hardware speeds, there should be room for improvement in all of those (pack would be bottlenecked by memory bandwidth, exchange by network bandwidth, and write by file system write bandwidth). That might be worth doing for larger input files, but I'm pretty content with the current performance.

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

adammoody and others added 30 commits August 11, 2021 12:51

add parallel merge using mpi

269af4e

handle case where some ranks might have 0 items

9ba081b

add inclusive scan prefix sum

d29a702

report more timing info

ed49713

Update megatron/data/indexed_dataset.py

e94f2a0

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Update megatron/data/indexed_dataset.py

687ff32

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

rename total size variable for clarity

af59545

move translation to bin/idx file names a level deeper

4f648a0

parallel merge for cached dataset

9f2ba6a

add alltrue function

72d6c9c

move collectives to new distdata class, add torch.distributed

8b67bec

drop unused prefix_sum function

3eca1f3

allow ranks to pass a list of files to be merged

a691b48

check that input dataset files exist

e4a34e2

fix: using wrong doc_idx list for mmap

8b168ca

move init dist and collectives to distdata class

7a02693

add --merge option, move parallel/serial to their own functions

eca2940

Merge branch 'main' into pmerge

b14491d

Update megatron/data/distdata.py

ec11281

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Update megatron/data/indexed_dataset.py

354d13b

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Update megatron/data/indexed_dataset.py

2dc3f7a

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Update megatron/data/indexed_dataset.py

980e904

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Update megatron/data/indexed_dataset.py

ebd20a6

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Update megatron/data/indexed_dataset.py

69b2f49

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Update megatron/data/indexed_dataset.py

50de06a

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

drop extraneous numpy tolist calls

af290ad

rename self.MPI to mpi4py

4b58c74

handle case where no ranks have elements in their file

71a2fdc

rename tokenize_start to time_start

73d3a24

drop unrelated comment in distdata.min

b9e69be

adammoody added 25 commits August 17, 2021 22:49

fix: exclusive scan computing pointers list

1216c0a

Merge branch 'pointerfix' into pmerge

a64d3da

fix: exclusive scan to compute mmap pointers list

fde439e

abstraction to index and randomly access jsonl files

fb274bf

rebase on parallel merge, replace mpi4py with distdata class

d428c02

note about seek

ba14351

rename preprocess_dataset_mpi.py to preprocess_data_dist.py

852fdd0

update usage comments at top of script

61f4b46

Merge branch 'pmerge' into mpijson

18881ae

look for extension .jsonl

bd6f41f

add progress messages

3488d0b

rebuild index if mtime is old

1305fe9

store index values in network byte order

6bcac1f

add magic value and format version number to index file

813d068

Merge branch 'main' into mpijson

0510081

clean up merge

1fea302

clean up merge

d360313

pass distctx instead of mpi_comm to IndexedJSON

20a43af

move existence test and stat queries to distdata

7b08347

add exception handling

8d448bc

edit typos in comments

6f7519f

close shared file if open fails on any rank

3f9078d

add distributed shuffle

fbd38bf

shuffle on each rank to keep rng in step

927fbc1

optimize broadcast and sample ident steps with numpy

d243847

adammoody mentioned this pull request Sep 2, 2021

Distributed terashuf #91

Closed

adammoody added 3 commits September 2, 2021 14:33

add timer for global shuffle step

6d7dd4b

generate random seed if not specified

6bc9b94

add function to concatenate files

0b4a4ca

adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Jun 21, 2023

Modifying loss checking to support bf16. (bigscience-workshop#92)

4cf3f4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: distributed terashuf#92

WIP: distributed terashuf#92
adammoody wants to merge 81 commits intobigscience-workshop:mainfrom
adammoody:distshuf

adammoody commented Sep 2, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adammoody commented Sep 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adammoody commented Sep 2, 2021 •

edited

Loading