Skip to content

fix: doc_idx offset when merging indexed dataset files#66

Merged
thomasw21 merged 1 commit intobigscience-workshop:mainfrom
adammoody:fixdocidx
Aug 16, 2021
Merged

fix: doc_idx offset when merging indexed dataset files#66
thomasw21 merged 1 commit intobigscience-workshop:mainfrom
adammoody:fixdocidx

Conversation

@adammoody
Copy link
Contributor

This fixes an issue when computing the document index offset value when merging cached IndexedDataset files. The problem was that offset is overwritten in the data_offset loop before it is used to adjust the document index values.

@adammoody
Copy link
Contributor Author

@thomasw21 , would you please take a look at this one, too, when you get a chance?

Copy link
Member

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot for noticing!

@thomasw21 thomasw21 merged commit c680954 into bigscience-workshop:main Aug 16, 2021
@adammoody adammoody deleted the fixdocidx branch August 16, 2021 19:49
@adammoody
Copy link
Contributor Author

Thanks, @thomasw21

jaredcasper pushed a commit to NVIDIA/Megatron-LM that referenced this pull request May 20, 2022
tools/merge_datasets.py
- tool to merge multiple dataset files into a single dataset
- testing conducted and included in the megatron-testing repo https://gitlab-master.nvidia.com/ADLR/megatron-testing

tools/preprocess_data.py
- magic numbers changed to required command line arguments

megatron/data/indexed_dataset.py
- when merging, fix to properly update document index
- testing conducted and included in the megatron-testing repo (see above)
- fix follows this history bigscience-workshop/Megatron-DeepSpeed#66
adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Oct 27, 2022
* add changes for enabling AML run

* update dockerfile and submit script

* fix spelling

Co-authored-by: Miseon Park <mipark@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants