Compact Loss

Our model uses a lot of parameters for the output layer. Specifically, `2 * vocab_size * devices * features`, where `features=256` and `devices=256` for the planned 20B model, implying that it would use 4.2B + 4.2B parameters using the GPT-2 tokenizer purely for the embedding matrices.\
For example, [ALBERT](https://arxiv.org/abs/1909.11942v6) used factorized embeddings, reducing the number of parameters from `256*256*vocab = 8.59B` to `256*256*sqrt(vocab)*2 = 33.5M` .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compact Loss #79

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Compact Loss #79

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions