WIP: add oscar slurm script for preprocess_data_dist by adammoody · Pull Request #4 · bigscience-workshop/bigscience

adammoody · 2021-08-19T22:42:50Z

I can't run on JZ, but for a concrete example, I think a script like the one in this PR could be used with the new preprocess_data_dist.py script. This requires the JSON support added in PR bigscience-workshop/Megatron-DeepSpeed#60

To process a JSON file, the script first generates an "index" that records the starting byte offset and length of each line in the source JSON file. That index file is stored beside the source JSON file to be reused in future runs. The index enables quick random access to the (variable-length) lines in the JSON file.

In the example SLURM script in this PR, for the source file:

$six_ALL_CCFRSCRATCH/datasets/oscar-small/oscar-en-shuffled-p1.jsonl

the preprocess_data_dist.py script will create the following files as a result of indexing the source JSON file:

$six_ALL_CCFRSCRATCH/datasets/oscar-small/oscar-en-shuffled-p1.jsonl.idx (persists after first run)
$six_ALL_CCFRSCRATCH/datasets/oscar-small/oscar-en-shuffled-p1.jsonl.idxtmp (created and deleted during run)

adammoody · 2021-08-19T22:56:11Z

If you verify this works well and correctly at smaller nodes counts, you might try scaling it higher. It has scaled well for me up to 64 nodes so far.

adammoody · 2021-08-19T23:55:31Z

Oh, and for a quick test, add something like preprocess_data_dist.py --count 1000 to limit the number of samples processed. It's good to test things with one or two nodes and a small sample count before trying large node counts and the full dataset.

stas00 · 2021-10-15T03:36:03Z

Hi Adam,

I have just noticed your WIP PR here - is this still relevant and then let's merge it, or if not move/close it?

adammoody · 2021-10-15T07:41:49Z

Thanks, @stas00 . No need to merge this one.

adammoody added 2 commits August 19, 2021 15:40

add oscar slurm script for preprocess_data_dist

05d7e33

fix a couple typos

5472a3b

adammoody mentioned this pull request Aug 20, 2021

extend preprocess_data_dist to handle jsonl files bigscience-workshop/Megatron-DeepSpeed#60

Open

5 tasks

adammoody changed the title ~~add oscar slurm script for preprocess_data_dist~~ WIP: add oscar slurm script for preprocess_data_dist Aug 20, 2021

simplify scontrol line to get first host

7677577

adammoody closed this Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: add oscar slurm script for preprocess_data_dist#4

WIP: add oscar slurm script for preprocess_data_dist#4
adammoody wants to merge 3 commits intobigscience-workshop:masterfrom
adammoody:preprocessdist

adammoody commented Aug 19, 2021 •

edited

Loading

Uh oh!

adammoody commented Aug 19, 2021

Uh oh!

adammoody commented Aug 19, 2021

Uh oh!

stas00 commented Oct 15, 2021

Uh oh!

adammoody commented Oct 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adammoody commented Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adammoody commented Aug 19, 2021

Uh oh!

adammoody commented Aug 19, 2021

Uh oh!

stas00 commented Oct 15, 2021

Uh oh!

adammoody commented Oct 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adammoody commented Aug 19, 2021 •

edited

Loading