WIP: distributed terashuf#92
Open
adammoody wants to merge 81 commits intobigscience-workshop:mainfrom
Open
Conversation
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Closed
adammoody
pushed a commit
to adammoody/Megatron-DeepSpeed
that referenced
this pull request
Jun 21, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is solid enough that I'll go ahead and post a WIP PR. It's based on #60, so this will look noisy until that PR is merged. Most of the changes are the distshuf file, for which I've got a link below.
I don't know how much time I will have to polish this up, but I have a prototype MPI-enabled "terashuf". This uses an algorithm similar to terashuf, where contiguous records from the source file are shuffled in segments, and then one randomly picks leading records from the shuffled segments to put together the final shuffled file.
This prototype currently stores the shuffled segments in memory (rather than files), and so it requires one to be able to load the full file into distributed memory. Currently each rank reads a portion of the source file into memory, shuffles that section, and then ranks exchange lines with each other in order to write out the file in contiguous chunks.
It can shuffle the oscar.jsonl file in about 10 minutes using 80 procs on 8 nodes on my system.
Just posting this notice in case others need to shuffle a large JSON file in a hurry.
https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/tools/distshuf.py
It currently requires
mpi4pyand an mpi4py enabledDistDataclass.https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/megatron/data/distdata_mpi.py
I first attempted a
torch.distributedversion, but hit some problems. I haven't yet gone back to see if atorch.distequivalent is easy.For speed and correctness, both the input and output files must be on a parallel file system like Lustre/GPFS.
Example command:
Update 2021-09-02:
Took a pass at using numpy to optimize performance a bit more. The tool currently prints a timing breakdown of its major operations as it goes, and the current number of seconds for each step in one phase picked at random looks like:
In each step in this particular run, each rank gathers 100_000 samples which are each about 5000 bytes on average. This is using 320 procs on 8 nodes. So the total data being processed in each step is about
100_000 * 5000 * 320 = 149 GiB. The data movement portions are pack, exchange, and write. Converting those component times in seconds to bandwidths:Based on system hardware speeds, there should be room for improvement in all of those (pack would be bottlenecked by memory bandwidth, exchange by network bandwidth, and write by file system write bandwidth). That might be worth doing for larger input files, but I'm pretty content with the current performance.