-
Notifications
You must be signed in to change notification settings - Fork 33k
Reformer #3351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+3,608
−23
Merged
Reformer #3351
Changes from all commits
Commits
Show all changes
162 commits
Select commit
Hold shift + click to select a range
ee0ce08
first copy & past commit from Bert and morgans LSH code
patrickvonplaten 3259115
add easy way to compare to trax original code
patrickvonplaten 25d162e
translate most of function
patrickvonplaten dc07c08
make trax lsh self attention deterministic with numpy seed + copy pas…
patrickvonplaten 09d4230
add same config
patrickvonplaten 9386450
add same config
patrickvonplaten 3fb1182
fix merge conflicts
patrickvonplaten ebb9d3f
make layer init work
patrickvonplaten c910e03
implemented hash_vectors function for lsh attention
patrickvonplaten b956933
continue reformer translation
patrickvonplaten a449c2e
hf LSHSelfAttentionLayer gives same output as trax layer
patrickvonplaten 35491d8
refactor code
patrickvonplaten c9a0919
refactor code
patrickvonplaten 4ef4739
refactor code
patrickvonplaten f580074
refactor
patrickvonplaten 4176564
refactor + add reformer config
patrickvonplaten 8e02fe7
delete bogus file
patrickvonplaten 2af8377
split reformer attention layer into two layers
patrickvonplaten 6fe9478
fix merge conflicts
patrickvonplaten 9e6e1af
save intermediate step
patrickvonplaten 1855074
save intermediate step
patrickvonplaten 1a4e61a
make test work
patrickvonplaten da6bfe4
add complete reformer block layer
patrickvonplaten 2825d24
finish reformer layer
patrickvonplaten 45e6635
implement causal and self mask
patrickvonplaten b5ed5d4
clean reformer test and refactor code
patrickvonplaten ddb2f09
update init
patrickvonplaten cbb5ab9
fix device for GPU
patrickvonplaten f17fd5b
fix chunk length init for tests
patrickvonplaten eca8cce
include morgans optimization
patrickvonplaten db2ebb1
improve memory a bit
patrickvonplaten 04aa067
improve comment
patrickvonplaten 4aec75e
factorize num_buckets
patrickvonplaten d030e39
better testing parameters
patrickvonplaten d318089
make whole model work
patrickvonplaten 4f0b114
make lm model work
patrickvonplaten 6c8bad6
add t5 copy paste tokenizer
patrickvonplaten b71ef16
add chunking feed forward
patrickvonplaten 99427c6
clean config
patrickvonplaten 4ffa925
add improved assert statements
patrickvonplaten b116e3c
make tokenizer work
patrickvonplaten 79a0bab
improve test
patrickvonplaten aceb586
correct typo
patrickvonplaten a4814bd
extend config
patrickvonplaten 5eeeb25
add complexer test
patrickvonplaten 0ee5db4
add new axial position embeddings
patrickvonplaten 938aa8b
add local block attention layer
patrickvonplaten 4d7c23b
clean tests
patrickvonplaten 50276de
refactor
patrickvonplaten 37a2b00
better testing
patrickvonplaten 07c0c72
save intermediate progress
patrickvonplaten 060a691
clean test file
patrickvonplaten ace301f
make shorter input length work for model
patrickvonplaten 80d18db
allow variable input length
patrickvonplaten 86f4ac4
refactor
patrickvonplaten e571849
make forward pass for pretrained model work
patrickvonplaten d5e1363
add generation possibility
patrickvonplaten 562d530
finish dropout and init
patrickvonplaten c98eafe
make style
patrickvonplaten 9c9fab9
refactor
patrickvonplaten a188a39
add first version of RevNet Layers
patrickvonplaten 8047573
make forward pass work and add convert file
patrickvonplaten 31a596b
make uploaded model forward pass work
patrickvonplaten bae0700
make uploaded model forward pass work
patrickvonplaten 831dcec
refactor code
patrickvonplaten 57ee09c
add namedtuples and cache buckets
patrickvonplaten 2d23fad
correct head masks
patrickvonplaten 0c35bbf
refactor
patrickvonplaten 232463e
made reformer more flexible
patrickvonplaten 2648a94
make style
patrickvonplaten 902408b
remove set max length
patrickvonplaten 8ed63ab
add attention masks
patrickvonplaten 513bb43
fix up tests
patrickvonplaten db60c23
fix conflict
patrickvonplaten 9f359af
fix lsh attention mask
patrickvonplaten 48097a0
make random seed optional for the moment
patrickvonplaten 650e00c
improve memory in reformer
patrickvonplaten ccba9ac
add tests
patrickvonplaten f83721e
make style
patrickvonplaten 125c86d
make sure masks work correctly
patrickvonplaten 2beda9c
detach gradients
patrickvonplaten 12e35e1
save intermediate
patrickvonplaten 8b058e2
correct backprob through gather
patrickvonplaten 69258b8
make style
patrickvonplaten 44c3a7c
change back num hashes
patrickvonplaten 48fff07
rename to labels
patrickvonplaten 55842be
fix rotation shape
patrickvonplaten 71426c0
fix detach
patrickvonplaten dfbcf8f
update
patrickvonplaten 0ea564c
fix trainer
patrickvonplaten af3456c
fix backward dropout
patrickvonplaten 002f19c
make reformer more flexible
patrickvonplaten 7de3f4f
fix
patrickvonplaten 6111bd5
fix
patrickvonplaten 0c75149
add tests for fixed seed in reformer layer
patrickvonplaten 7a03bc7
fix trainer typo
patrickvonplaten 37943f3
fix typo in activations
patrickvonplaten 0f751f5
add fp16 tests
patrickvonplaten 8df5dcd
add fp16 training
patrickvonplaten 51426b5
support fp16
patrickvonplaten b37fd3b
correct gradient bug in reformer
patrickvonplaten e3e05ef
add fast gelu
patrickvonplaten c3e32b4
re-add dropout for embedding dropout
patrickvonplaten 52ee5ed
better naming
patrickvonplaten ece19ee
better naming
patrickvonplaten e661832
renaming
patrickvonplaten f1a6355
finalize test branch
patrickvonplaten ea1126e
finalize tests
patrickvonplaten d4bc3c6
add more tests
patrickvonplaten 94086ac
finish tests
patrickvonplaten 01f4074
fix
patrickvonplaten 9dafbc2
fix type trainer
patrickvonplaten de08a57
fix fp16 tests
patrickvonplaten aa570dc
fix tests
patrickvonplaten a681d19
fix tests
patrickvonplaten 320c045
fix tests
patrickvonplaten 482c6cd
fix issue with dropout
patrickvonplaten d7905dd
fix dropout seeds
patrickvonplaten 764e06e
correct random seed on gpu
patrickvonplaten a3e0f59
finalize random seed for dropout
patrickvonplaten c48f88a
finalize random seed for dropout
patrickvonplaten ce87cb6
remove duplicate line
patrickvonplaten d418dd0
correct half precision bug
patrickvonplaten 3248e67
make style
patrickvonplaten 6fe0648
refactor
patrickvonplaten c3031b8
refactor
patrickvonplaten 6c2be30
docstring
patrickvonplaten 3d266fb
remove sinusoidal position encodings for reformer
patrickvonplaten 1be343f
move chunking to modeling_utils
patrickvonplaten a10eb2e
make style
patrickvonplaten f31b570
clean config
patrickvonplaten b2a660f
make style
patrickvonplaten dfc1f64
fix tests
patrickvonplaten 2e95c17
fix auto tests
patrickvonplaten b95f6ae
pretrained models
patrickvonplaten a6f69cb
fix docstring
patrickvonplaten 59868f3
update conversion file
patrickvonplaten a81c3e0
Update pretrained_models.rst
patrickvonplaten c0ddf94
fix rst
patrickvonplaten 62a8eb0
fix rst
patrickvonplaten 47e5fc8
update copyright
patrickvonplaten b6576c8
fix test path
patrickvonplaten a111720
fix test path
patrickvonplaten ff5e783
fix small issue in test
patrickvonplaten f7f949b
include reformer in generation tests
patrickvonplaten 91472b8
add docs for axial position encoding
patrickvonplaten 6ed2fa8
finish docs
patrickvonplaten 963bb5e
Update convert_reformer_trax_checkpoint_to_pytorch.py
patrickvonplaten 425b185
remove isort
patrickvonplaten 3336d8f
include sams comments
patrickvonplaten 54eb629
remove wrong comment in utils
patrickvonplaten e4e1e59
correct typos
patrickvonplaten 5f5c89b
fix typo
patrickvonplaten 7fdf16b
Update reformer.rst
patrickvonplaten 7ccec6a
applied morgans optimization
patrickvonplaten 3978afa
make style
patrickvonplaten 01b1006
make gpu compatible
patrickvonplaten e983a69
remove bogus file
patrickvonplaten 9ce32f0
big test refactor
patrickvonplaten 67f02c0
add example for chunking
patrickvonplaten 4e7252a
fix typo
patrickvonplaten ca4dab3
add to README
patrickvonplaten File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| Reformer | ||
| ---------------------------------------------------- | ||
| **DISCLAIMER:** This model is still a work in progress, if you see something strange, | ||
| file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_ | ||
|
|
||
| Overview | ||
|
patrickvonplaten marked this conversation as resolved.
|
||
| ~~~~~ | ||
| The Reformer model was presented in `Reformer: The Efficient Transformer <https://https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. | ||
| Here the abstract: | ||
|
|
||
| *Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.* | ||
|
|
||
| The Authors' code can be found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`_ . | ||
|
|
||
| Axial Positional Encodings | ||
| ~~~~~~~~~~~~~~~~~~~~ | ||
| Axial Positional Encodings were first implemented in Google's `trax library <https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29>`_ and developed by the authors of this model's paper. In models that are treating very long input sequences, the conventional position id encodings store an embedings vector of size :math:`d` being the ``config.hidden_size`` for every position :math:`i, \ldots, n_s`, with :math:`n_s` being ``config.max_embedding_size``. *E.g.*, having a sequence length of :math:`n_s = 2^{19} \approx 0.5M` and a ``config.hidden_size`` of :math:`d = 2^{10} \approx 1000` would result in a position encoding matrix: | ||
|
|
||
| .. math:: | ||
| X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right] | ||
|
|
||
| which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices: | ||
|
|
||
| .. math:: | ||
| X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right] | ||
|
|
||
| and | ||
|
|
||
| .. math:: | ||
| X^{2}_{i,j}, \text{ with } i \in \left[1,\ldots, d^2\right] \text{ and } j \in \left[1,\ldots, n_s^2\right] | ||
|
|
||
| with: | ||
|
|
||
| .. math:: | ||
| d = d^1 + d^2 \text{ and } n_s = n_s^1 \times n_s^2 . | ||
|
|
||
| Therefore the following holds: | ||
|
|
||
| .. math:: | ||
| X_{i,j} = \begin{cases} | ||
| X^{1}_{i, k}, & \text{if }\ i < d^1 \text{ with } k = j \mod n_s^1 \\ | ||
| X^{2}_{i - d^1, l}, & \text{if } i \ge d^1 \text{ with } l = \lfloor\frac{j}{n_s^1}\rfloor | ||
| \end{cases} | ||
|
|
||
| Intuitively, this means that a position embedding vector :math:`x_j \in \mathbb{R}^{d}` is now the composition of two factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the ``config.max_embedding_size`` dimension :math:`j` is factorized into :math:`k \text{ and } l`. | ||
| This design ensures that each position embedding vector :math:`x_j` is unique. | ||
|
|
||
| Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}` can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters. | ||
|
|
||
| In practice, the parameter ``config.axial_pos_embds_dim`` is set to ``list``:math:`(d^1, d^2)` which sum has to be equal to ``config.hidden_size`` and ``config.axial_pos_shape`` is set to ``list``:math:`(n_s^1, n_s^2)` and which product has to be equal to ``config.max_embedding_size`` which during training has to be equal to the ``sequence length`` of the ``input_ids``. | ||
|
|
||
|
|
||
|
|
||
| LSH Self Attention | ||
| ~~~~~~~~~~~~~~~~~~~~ | ||
| In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key query embedding vectors are also tied. | ||
| LSH self attention uses the locality sensitive | ||
| hashing mechanism proposed in `Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`_ to assign each of the tied key query embedding vectors to one of ``config.num_buckets`` possible buckets. The premise is that the more "similar" key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to the same bucket. | ||
| The accuracy of the LSH mechanism can be improved by increasing ``config.num_hashes`` or directly the argument ``num_hashes`` of the forward function so that the output of the LSH self attention better approximates the output of the "normal" full self attention. | ||
| The buckets are then sorted and chunked into query key embedding vector chunks each of length ``config.lsh_chunk_length``. For each chunk, the query embedding vectors attend to its key vectors (which are tied to themselves) and to the key embedding vectors of ``config.lsh_num_chunks_before`` previous neighboring chunks and ``config.lsh_num_chunks_after`` following neighboring chunks. | ||
| For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`_ or this great `blog post <https://www.pragmatic.ml/reformer-deep-dive/>`_. | ||
|
|
||
| Note that ``config.num_buckets`` can also be factorized into a ``list``:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to save memory. | ||
|
|
||
| It is recommended to leave ``config.num_buckets=None``, so that depending on the sequence length, a good value for ``num_buckets`` are calculated on the fly. | ||
|
|
||
| Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length. | ||
|
|
||
|
|
||
| Local Self Attention | ||
| ~~~~~~~~~~~~~~~~~~~~ | ||
| Local self attention is essentially a "normal" self attention layer with | ||
| key, query and value projections, but is chunked so that in each chunk of length ``config.local_chunk_length`` the query embedding vectors only attends to the key embedding vectors in its chunk and to the key embedding vectors of ``config.local_num_chunks_before`` previous neighboring chunks and ``config.local_num_chunks_after`` following neighboring chunks. | ||
|
|
||
| Using Local self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length. | ||
|
|
||
|
|
||
| Training | ||
| ~~~~~~~~~~~~~~~~~~~~ | ||
| During training, we must ensure that the sequence length is set to a value that can be divided by the least common multiple of ``config.lsh_chunk_length`` and ``config.local_chunk_length`` and that the parameters of the Axial Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can easily be trained on sequences as long as 64000 tokens. | ||
| For training, the ``ReformerModelWithLMHead`` should be used as follows: | ||
|
|
||
| :: | ||
|
|
||
| input_ids = tokenizer.encode('This is a sentence from the training data', return_tensors='pt') | ||
| loss = model(input_ids, labels=input_ids)[0] | ||
|
|
||
|
|
||
| ReformerConfig | ||
| ~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| .. autoclass:: transformers.ReformerConfig | ||
| :members: | ||
|
|
||
|
|
||
| ReformerTokenizer | ||
| ~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| .. autoclass:: transformers.ReformerTokenizer | ||
| :members: | ||
|
|
||
|
|
||
| ReformerModel | ||
| ~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| .. autoclass:: transformers.ReformerModel | ||
| :members: | ||
|
patrickvonplaten marked this conversation as resolved.
|
||
|
|
||
|
|
||
| ReformerModelWithLMHead | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| .. autoclass:: transformers.ReformerModelWithLMHead | ||
| :members: | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,12 +34,18 @@ def gelu_new(x): | |
| else: | ||
| gelu = F.gelu | ||
|
|
||
|
|
||
| def gelu_fast(x): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this different than the builtins? Maybe add docstring?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also if it isn't faster, does it need a new name? |
||
| return 0.5 * x * (1 + torch.tanh(x * 0.7978845608 * (1 + 0.044715 * x * x))) | ||
|
|
||
|
|
||
| ACT2FN = { | ||
| "relu": F.relu, | ||
| "swish": swish, | ||
| "gelu": gelu, | ||
| "tanh": torch.tanh, | ||
| "gelu_new": gelu_new, | ||
| "gelu_fast": gelu_fast, | ||
| } | ||
|
|
||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.