fix: doc_idx offset when merging indexed dataset files#66
Merged
thomasw21 merged 1 commit intobigscience-workshop:mainfrom Aug 16, 2021
Merged
fix: doc_idx offset when merging indexed dataset files#66thomasw21 merged 1 commit intobigscience-workshop:mainfrom
thomasw21 merged 1 commit intobigscience-workshop:mainfrom
Conversation
Contributor
Author
|
@thomasw21 , would you please take a look at this one, too, when you get a chance? |
4 tasks
thomasw21
approved these changes
Aug 16, 2021
Member
thomasw21
left a comment
There was a problem hiding this comment.
LGTM! Thanks a lot for noticing!
Contributor
Author
|
Thanks, @thomasw21 |
jaredcasper
pushed a commit
to NVIDIA/Megatron-LM
that referenced
this pull request
May 20, 2022
tools/merge_datasets.py - tool to merge multiple dataset files into a single dataset - testing conducted and included in the megatron-testing repo https://gitlab-master.nvidia.com/ADLR/megatron-testing tools/preprocess_data.py - magic numbers changed to required command line arguments megatron/data/indexed_dataset.py - when merging, fix to properly update document index - testing conducted and included in the megatron-testing repo (see above) - fix follows this history bigscience-workshop/Megatron-DeepSpeed#66
adammoody
pushed a commit
to adammoody/Megatron-DeepSpeed
that referenced
this pull request
Oct 27, 2022
* add changes for enabling AML run * update dockerfile and submit script * fix spelling Co-authored-by: Miseon Park <mipark@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This fixes an issue when computing the document index offset value when merging cached
IndexedDatasetfiles. The problem was thatoffsetis overwritten in thedata_offsetloop before it is used to adjust the document index values.