Tokenizers v3.0.0 by mfuntowicz · Pull Request #3185 · huggingface/transformers

mfuntowicz · 2020-03-09T10:34:57Z

No description provided.

LysandreJik

Overall, this is quite a different test suite than what we have in test_modeling_common and test_tokenization_common, but in a good way, imo.

The use of subTest will greatly help debugging and splitting between test_xxx methods and assert_xxx methods makes the code cleaner and easier to read.

LysandreJik · 2020-03-10T14:58:07Z

+    TOKENIZERS_CLASSES = frozenset([
+        Tokenizer("Bert", BertTokenizerFast, BertTokenizer, "vocab_file"),
+        Tokenizer("DistilBert", DistilBertTokenizerFast, DistilBertTokenizer, "vocab_file"),
+        Tokenizer("Roberta", RobertaTokenizerFast, RobertaTokenizer, "vocab_file"),
+    ])


frozenset + named tuple: nice!

LysandreJik · 2020-03-10T15:00:14Z

+        for (name, rust_cls, python_cls, vocab_key) in self.TOKENIZERS_CLASSES:
+            for pretrained_name in python_cls.pretrained_vocab_files_map[vocab_key].keys():
+                with self.subTest("{} ({})".format(name, pretrained_name)):


This seems to me like to optimal organisation. Testing on every checkpoint on every tokenizer organized in subTests is really thorough.

Does it take a while? Should we mark this as slow or is it fast enough?

It takes around 1 min for doing everything

LysandreJik · 2020-03-10T15:06:38Z

+            padded_tokens_r = list(takewhile(lambda i: i == tokenizer_r.pad_token_id, reversed(input_r)))
+            padded_tokens_p = list(takewhile(lambda i: i == tokenizer_p.pad_token_id, reversed(input_p)))


We usually try to use lambdas as little as possible, as they're usually a bit hard to read. cc @thomwolf

hmm here I think it's fine, no?

In tests I'm more fine with them

thomwolf

This is looking great, just a few comments on things to check

thomwolf · 2020-03-23T21:31:09Z

+        return_token_type_ids: bool = True,
+        return_attention_mask: bool = True,


You should rebase on master if you can because I think this is now Optional[bool] = None since @LysandreJik worked to adapt the output to the models.

I'm not sure to fully undestand why bool should be Optional ? A default value would be more understable imho and remove the need for None checking, wdyt ?

You should check w @LysandreJik :)

The spirit behind having those values as None by default is the following: if this value is None, then all those returns are set to the default tokenizer-specific values. This is different from tokenizer to tokenizer, e.g. DistilBERT should by default return input_ids and attention_mask, but not token_type_ids, as the model cannot handle it. This in turn allows the user to do the following:

inputs = tokenizer.encode_plus(values, return_tensors="pt") model(**inputs)

And this now works with every model.

These values can still be explicitly set to True or False by the user. See #3116 for more information/implementation details.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…Fast Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…ded to the output #3091 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…sMixin)

…enizer + tests.

…ncode_plus methods parameter.

…enizerFast. Avoid stripping on None values.

This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward.

Backward compatibility.

… in majority of cases.

…structor parameter on Rust Tokenizers.

n1t0

A few details here and there, but otherwise looks good to me!

n1t0 · 2020-03-30T13:59:51Z

+    ) -> List[Encoding]:
+        if sequences is None:
+            raise ValueError(
+                "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."


I think these messages might be more specific for each method. This one should probably just say list/tuple of strings.

n1t0 · 2020-03-30T14:00:06Z

+    def encode(self, sequence: str, pair: Optional[str] = None, add_special_tokens: bool = False) -> Encoding:
+        if sequence is None:
+            raise ValueError(
+                "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."


This one should probably just say string

n1t0 · 2020-03-30T14:12:14Z

                encoding_dict["special_tokens_mask"].append(e.special_tokens_mask)
            if return_offsets_mapping:
-                encoding_dict["offset_mapping"].append([e.original_str.offsets(o) for o in e.offsets])
+                encoding_dict["offset_mapping"] = [o for o in e.offsets]


Should it be like above:

Suggested change

encoding_dict["offset_mapping"] = [o for o in e.offsets]

encoding_dict["offset_mapping"].append(e.offsets)

n1t0 · 2020-03-30T14:17:26Z

+
+        if batch_text_or_text_pairs is None:
+            raise ValueError(
+                "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."


Same here, I think we can remove the string option

n1t0 · 2020-03-30T14:26:28Z

+        self.assert_embeded_special_tokens(tokenizer_r, tokenizer_p)
+        self.assert_padding(tokenizer_r, tokenizer_p)
+        # TODO: enable for v3.0.0
+        # self.assert_empty_output_no_special_tokens(tokenizer_r, tokenizer_p)


Should this be enabled?

n1t0 · 2020-03-30T14:32:52Z


-            # Check for dynamic encoding sequence handling in batch_encode_plus
-            self.assert_batch_encode_dynamic_overflowing(tokenizer_r)
+        # Rust correctly handles the space before the mask while python doesnt


I think Python is right here. There shouldn't be any space before the <mask> token. This means that Roberta on the fast path should probably have an AddedToken('<mask>', lstrip=True)

n1t0 · 2020-03-30T14:34:34Z

        # Testing tokenization
        tokens = tokenizer.tokenize(sequence, add_prefix_space=True)
-        rust_tokens = rust_tokenizer.tokenize(sequence)
+        rust_tokens = rust_tokenizer.tokenize(sequence, add_prefix_space=True)


add_prefix_space=True isn't required here I think

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…every iteration. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

… for Roberta. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

…utes. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

codecov-io · 2020-04-01T14:33:25Z

Codecov Report

Merging #3185 into master will decrease coverage by 0.24%.
The diff coverage is 80.42%.

@@            Coverage Diff             @@
##           master    #3185      +/-   ##
==========================================
- Coverage   77.79%   77.55%   -0.25%     
==========================================
  Files         100      100              
  Lines       17025    17105      +80     
==========================================
+ Hits        13245    13265      +20     
- Misses       3780     3840      +60

Impacted Files	Coverage Δ
src/transformers/tokenization_bert.py	`95.33% <ø> (-1.7%)`	⬇️
src/transformers/tokenization_roberta.py	`94.36% <100%> (-5.64%)`	⬇️
src/transformers/pipelines.py	`74.51% <100%> (-0.28%)`	⬇️
src/transformers/tokenization_transfo_xl.py	`40.67% <100%> (-0.43%)`	⬇️
src/transformers/tokenization_utils.py	`86.18% <78.61%> (-5.81%)`	⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7420a6a...860cf66. Read the comment docs.

LysandreJik

This isn't easy to review as the diff mixes substractions from PreTrainedTokenizer and additions from BatchEncoding; other than that, cool! Thanks @mfuntowicz :)

LysandreJik · 2020-04-01T21:32:53Z

+# Define type aliases
+TextInput = str
+TextPairInput = Tuple[str, str]
+PreTokenizedInput = List[str]
+PreTokenizedInputPair = Tuple[List[str], List[str]]


I like this!

LysandreJik · 2020-04-02T14:25:07Z

+        Find the Offsets of the token containing the character at the specified position
+        :param sentence: Index of the sentence relative to the batch provided to the tokenizer.
+        :param char: Char index to get the relative token offsets
+        :return: (token start, token end)


(Applicable to most other docstrings) We use Google style doc in the library - could we try and use them here as well?

LysandreJik · 2020-04-02T14:27:28Z

Really like the new typings!

julien-c · 2020-04-07T22:33:04Z


        # Filter out features not available on specific models
-        inputs = self.inputs_for_model(inputs)
+        # inputs = self.inputs_for_model(inputs)


Let's remember to remove this for good soon @mfuntowicz

mfuntowicz force-pushed the tokenizers-v3.0.0 branch from 6b87216 to e65e5de Compare March 9, 2020 10:50

LysandreJik approved these changes Mar 10, 2020

View reviewed changes

mfuntowicz force-pushed the tokenizers-v3.0.0 branch from 9941cc5 to 9c92af3 Compare March 21, 2020 11:08

thomwolf reviewed Mar 23, 2020

View reviewed changes

mfuntowicz and others added 26 commits March 26, 2020 15:00

Renamed num_added_tokens to num_special_tokens_to_add

e56b134

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Added property is_fast on PretrainedTokenizer and PretrainedTokenizer…

87ca89b

…Fast Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Cherry-Pick: Partially fix space only input without special tokens ad…

7a2e94d

…ded to the output #3091 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Make fast tokenizers unittests work on Windows.

b5d3b1f

Entirely refactored unittest for tokenizers fast.

8ab017b

Remove ABC class for CommonFastTokenizerTest

001d17d

Added embeded_special_tokens tests from allenai @dirkgr

4e8a2c0

Make embeded_special_tokens tests from allenai more generic

3607c3e

Uniformize vocab_size as a property for both Fast and normal tokenizers

0b217e6

Move special tokens handling out of PretrainedTokenizer (SpecialToken…

286e216

…sMixin)

Ensure providing None input raise the same ValueError than Python tok…

6a18599

…enizer + tests.

Fix invalid input for assert_padding when testing batch_encode_plus

7857b86

Move add_special_tokens from constructor to tokenize/encode/[batch_]e…

fbc7bfa

…ncode_plus methods parameter.

Ensure tokenize() correctly forward add_special_tokens to rust.

72182c6

Adding None checking on top on encode / encode_batch for TransfoXLTok…

0c00f9b

…enizerFast. Avoid stripping on None values.

unittests ensure tokenize() also throws a ValueError if provided None

259331c

Added add_special_tokens unittest for all supported models.

193179d

Style

aaa10c0

Make sure TransfoXL test run only if PyTorch is provided.

5e1341d

Split up tokenizers tests for each model type.

a9d44e7

Fix invalid unittest with new tokenizers API.

bcfa8ec

Filter out Roberta openai detector models from unittests.

f04e9f0

Introduce BatchEncoding on fast tokenizers path.

8d24127

This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward.

Introduce BatchEncoding on slow tokenizers path.

523ad2e

Backward compatibility.

Improve error message on BatchEncoding for slow path

c1ec9cd

Make add_prefix_space True by default on Roberta fast to match Python…

c6edbbd

… in majority of cases.

mfuntowicz added 2 commits March 30, 2020 15:29

Fix unittests failing because add_special_tokens was defined as a con…

ed164b7

…structor parameter on Rust Tokenizers.

Fix text-classification pipeline using the wrong tokenizer

44b56d9

n1t0 reviewed Mar 30, 2020

View reviewed changes

mfuntowicz and others added 4 commits March 30, 2020 17:25

Make pipelines works with BatchEncoding

038a17a

Turn off add_special_tokens on tokenize by default.

fa738ab

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Remove add_prefix_space from tokenize call in unittest.

7cc45cf

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Style and quality

14877c7

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

mfuntowicz marked this pull request as ready for review March 31, 2020 13:11

n1t0 mentioned this pull request Mar 31, 2020

[Tokenization] fix edge case for bert tokenization #3517

Merged

mfuntowicz and others added 13 commits March 31, 2020 16:30

Correct message for batch_encode_plus none input exception.

737200c

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Fix invalid list comprehension for offset_mapping overriding content …

34ef60f

…every iteration. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

TransfoXL uses Strip normalizer.

3c1d32a

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Bump tokenizers dependency to 0.7.0rc3

afaed13

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Support AddedTokens for special_tokens and use left stripping on mask…

f859251

… for Roberta. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

SpecilaTokenMixin can use slots to faster access to underlying attrib…

9317065

…utes. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Remove update_special_tokens from fast tokenizers.

ec48498

Ensure TransfoXL unittests are run only when torch is available.

713dd52

Style.

b45ffba

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Style

7617fdc

Style 🙏🙏

bdd07a4

Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

1b71c30

Remove Roberta warning on __init__.

860cf66

LysandreJik self-requested a review April 1, 2020 16:15

LysandreJik approved these changes Apr 2, 2020

View reviewed changes

Move documentation to Google style.

178b5e9

mfuntowicz merged commit 96ab75b into master Apr 6, 2020

mfuntowicz deleted the tokenizers-v3.0.0 branch April 6, 2020 22:29

julien-c reviewed Apr 7, 2020

View reviewed changes

		padded_tokens_r = list(takewhile(lambda i: i == tokenizer_r.pad_token_id, reversed(input_r)))
		padded_tokens_p = list(takewhile(lambda i: i == tokenizer_p.pad_token_id, reversed(input_p)))

		return_token_type_ids: bool = True,
		return_attention_mask: bool = True,

	encoding_dict["offset_mapping"] = [o for o in e.offsets]
	encoding_dict["offset_mapping"].append(e.offsets)

Conversation

mfuntowicz commented Mar 9, 2020

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n1t0 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Apr 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LysandreJik commented Apr 2, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-io commented Apr 1, 2020 •

edited

Loading