LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens#17092
Merged
sgugger merged 4 commits intohuggingface:mainfrom May 9, 2022
Merged
Conversation
…samples and images in LayoutLMv2Processor
|
The documentation is not available anymore as the PR was closed or merged. |
sgugger
approved these changes
May 5, 2022
Collaborator
sgugger
left a comment
There was a problem hiding this comment.
LGTM but let's wait for @NielsRogge to have a look too!
NielsRogge
approved these changes
May 5, 2022
Collaborator
NielsRogge
left a comment
There was a problem hiding this comment.
LGTM, thanks for improving!
sgugger
reviewed
May 6, 2022
Collaborator
sgugger
left a comment
There was a problem hiding this comment.
Are you sure you have the exact same version as black as is pinned in our setup? The CI check for style is passing on master, so none of the reformatting unrelated to the changes in your PR is necessary.
The easiest might be to revert your last commit once you have made sure of the version of black, as black doesn't undo the lines it adds.
…g unrelated formatting changes
Collaborator
|
Thanks again for your contribution! |
|
Thanks for handling this @ghlai9665 |
elusenji
pushed a commit
to elusenji/transformers
that referenced
this pull request
Jun 12, 2022
… in case of overflowing tokens (huggingface#17092) * add get_overflowing_images function to ensure 1-to-1 mapping between samples and images in LayoutLMv2Processor * make style * add test for overflowing_tokens, change assert to ValueError, avoiding unrelated formatting changes * change line length by passing --preview into black
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #13554
Problem re-summarized: when
return_offsets_mappingis set to True,LayoutLMv2Processorwould break up sequences that are too long into multipleinput_idssequences, causing a mismatch betweeninput_ids(longer in length in the case of overflowing tokens) andimages.This fix would ensure the 1-to-1 mapping between the
imagesandinput_ids.Reproducible Example: (The assertion at the end would fail without the fix, pass with the fix)
Required Input from Reviewers
Right now, the LayoutLMv2Processor would return a list for
encoded_inputs["image"], regardless of the value ofreturn_tensors. If we want it to return a torch tensor in the casereturn_tensors=="pt", we have totorch.stackthe list (and do similar thing to support "np" and "tf").Should I implement this in
get_overflowing_images? Or should I just leave the return type as list and just print a warning?Who can review?
@NielsRogge @sgugger @LysandreJik
P.S.
The
test_processor_case_1intest_processor_layoutlmv2.pyfails before this PR. I'd be happy to look at it as well but it's unrelated to this PR.