-
Notifications
You must be signed in to change notification settings - Fork 6
Compact Loss #79
Copy link
Copy link
Open
Labels
coreImproves core model while keeping core idea intactImproves core model while keeping core idea intactresearchCreative project that might fail but could give high returnsCreative project that might fail but could give high returns
Metadata
Metadata
Assignees
Labels
coreImproves core model while keeping core idea intactImproves core model while keeping core idea intactresearchCreative project that might fail but could give high returnsCreative project that might fail but could give high returns
Our model uses a lot of parameters for the output layer. Specifically,
2 * vocab_size * devices * features, wherefeatures=256anddevices=256for the planned 20B model, implying that it would use 4.2B + 4.2B parameters using the GPT-2 tokenizer purely for the embedding matrices.For example, ALBERT used factorized embeddings, reducing the number of parameters from
256*256*vocab = 8.59Bto256*256*sqrt(vocab)*2 = 33.5M.