Skip to content

Conversation

@Am1n3e
Copy link
Collaborator

@Am1n3e Am1n3e commented Nov 23, 2021

Tested with bert tiny config:

> loading indexed mapping from /mnt/experiments/tmp/amine.elhattami/2021-11-23_11-31-53/wikipedia_20200501.en_dsver_1.13.3-json_BertWordPieceLowerCase_bert-large-uncased-vocab-txt_text_sentence_train_indexmap_503mns_509msl_0.10ssp_1234s.npy       
    loaded indexed file in 0.026 seconds                                                                                                                                                                                                                
    total number of samples: 9712743                                                                                                                                                                                                                    
 > WARNING: could not find index map file /mnt/experiments/tmp/amine.elhattami/2021-11-23_11-31-53/wikipedia_20200501.en_dsver_1.13.3-json_BertWordPieceLowerCase_bert-large-uncased-vocab-txt_text_sentence_valid_indexmap_644mns_509msl_0.10ssp_1234s.
npy, building the indices on rank 0 ...                                                                                                                                                                                                                 
 > building sapmles index mapping for valid ...                                                                                                                                                                                                         
    using uint32 for data mapping...                                                                                                                                                                                                                    
    using:                                                                                                                                                                                                                                              
     number of documents:            6331                                                                                                                                                                                                               
     sentences range:                [125651126, 125733008)                                                                                                                                                                                             
     total number of sentences:      81882                                                                                                                                                                                                              
     number of epochs:               2147483646                                                                                                                                                                                                         
     maximum number of samples:      644                                                                                                                                                                                                                
     maximum sequence length:        509                                                                                                                                                                                                                
     short sequence probability:     0.1                                                                                                                                                                                                                
     short sequence ration (1/prob): 10                                                                                                                                                                                                                 
     seed:                           1234                                                                                                                                                                                                               
    reached 0/644 samples after 0 epochs ...23-11-2021 16-32-45                                                                                                                                                                                         
    reached 644 samples after 1 epochs ...                                                                                                                                                                                                              
   number of empty documents: 0                                                                                                                                                                                                                         
   number of documents with one sentence: 743                                                                                                                                                                                                           
   number of documents with long sentences: 3                                                                                                                                                                                                           
   will create mapping for 7456 samples                                                                                                                                                                                                                 
    reached 0/644 samples after 0 epochs ...23-11-2021 16-32-45                                                                                                                                                                                         
 > done building sapmles index maping                                                                                                                                                                                                                   
 > saved the index mapping in /mnt/experiments/tmp/amine.elhattami/2021-11-23_11-31-53/wikipedia_20200501.en_dsver_1.13.3-json_BertWordPieceLowerCase_bert-large-uncased-vocab-txt_text_sentence_valid_indexmap_644mns_509msl_0.10ssp_1234s.npy         
 > elasped time to build and save samples mapping (seconds): 0.006325                                                                                              

@Am1n3e Am1n3e force-pushed the save-idx-file-to-out-folder branch from c2df53b to 95c0ce4 Compare November 23, 2021 16:39
@Am1n3e Am1n3e force-pushed the save-idx-file-to-out-folder branch from 95c0ce4 to 56ab4e0 Compare November 23, 2021 16:41
@Am1n3e Am1n3e requested a review from jlamypoirier November 23, 2021 16:43
Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions, otherwise LGTM

indexmap_filename += '_1sentok'
indexmap_filename += '.npy'

indexmap_file_path = str(Path(get_args().save).joinpath(indexmap_filename).absolute())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve is usually recommended over absolute. Also it should be fine to leave as a Path, and we could keep the variable indexmap_filename to reduce the diff

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same with other files)

@Am1n3e Am1n3e force-pushed the save-idx-file-to-out-folder branch from 19d4dbc to 4a3b163 Compare November 23, 2021 19:34
@Am1n3e Am1n3e force-pushed the save-idx-file-to-out-folder branch from 4a3b163 to ffce970 Compare November 23, 2021 19:35
@Am1n3e Am1n3e merged commit cd282e3 into llm-custom Nov 23, 2021
@jlamypoirier jlamypoirier deleted the save-idx-file-to-out-folder branch November 23, 2021 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants