LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens by garyhlai · Pull Request #17092 · huggingface/transformers

garyhlai · 2022-05-05T05:21:34Z

What does this PR do?

Problem re-summarized: when return_offsets_mapping is set to True, LayoutLMv2Processor would break up sequences that are too long into multiple input_ids sequences, causing a mismatch between input_ids (longer in length in the case of overflowing tokens) and images.

This fix would ensure the 1-to-1 mapping between the images and input_ids.

Reproducible Example: (The assertion at the end would fail without the fix, pass with the fix)

import transformers
from PIL import Image
from transformers import LayoutLMv2Processor
from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D, load_dataset
import torch

datasets = load_dataset("nielsr/funsd")
labels = datasets['train'].features['ner_tags'].feature.names
id2label = {v: k for v, k in enumerate(labels)}
label2id = {k: v for v, k in enumerate(labels)}

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

def preprocess_data(examples):
  images = [Image.open(path).convert("RGB") for path in examples['image_path']]
  words = examples['words']
  boxes = examples['bboxes']
  word_labels = examples['ner_tags']
  encoded_inputs = processor(images, words, boxes=boxes, word_labels=word_labels,
                             padding="max_length", truncation=True,
                             return_overflowing_tokens=True,
                             stride=50,
                             return_offsets_mapping=True,
                             return_tensors="pt")
  return encoded_inputs

train_data = preprocess_data(datasets["train"])

# this assert would fail without this PR fix. 
assert len(train_data["image"]) == len(train_data["input_ids"])

Required Input from Reviewers

Right now, the LayoutLMv2Processor would return a list for encoded_inputs["image"], regardless of the value of return_tensors. If we want it to return a torch tensor in the case return_tensors=="pt", we have to torch.stack the list (and do similar thing to support "np" and "tf").

Should I implement this in get_overflowing_images? Or should I just leave the return type as list and just print a warning?

Who can review?

@NielsRogge @sgugger @LysandreJik

P.S.

The test_processor_case_1 in test_processor_layoutlmv2.py fails before this PR. I'd be happy to look at it as well but it's unrelated to this PR.

…samples and images in LayoutLMv2Processor

HuggingFaceDocBuilderDev · 2022-05-05T05:38:39Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

LGTM but let's wait for @NielsRogge to have a look too!

NielsRogge

LGTM, thanks for improving!

sgugger

Are you sure you have the exact same version as black as is pinned in our setup? The CI check for style is passing on master, so none of the reformatting unrelated to the changes in your PR is necessary.
The easiest might be to revert your last commit once you have made sure of the version of black, as black doesn't undo the lines it adds.

…g unrelated formatting changes

sgugger · 2022-05-09T11:39:16Z

Thanks again for your contribution!

timothyjlaurent · 2022-05-14T01:47:13Z

Thanks for handling this @ghlai9665

… in case of overflowing tokens (huggingface#17092) * add get_overflowing_images function to ensure 1-to-1 mapping between samples and images in LayoutLMv2Processor * make style * add test for overflowing_tokens, change assert to ValueError, avoiding unrelated formatting changes * change line length by passing --preview into black

add get_overflowing_images function to ensure 1-to-1 mapping between …

1111e76

…samples and images in LayoutLMv2Processor

garyhlai mentioned this pull request May 5, 2022

LayoutLMv2 processing doesn't handle tokenizer overflow #13554

Closed

make style

212e032

sgugger approved these changes May 5, 2022

View reviewed changes

Comment thread src/transformers/models/layoutlmv2/processing_layoutlmv2.py Outdated

NielsRogge approved these changes May 5, 2022

View reviewed changes

sgugger reviewed May 6, 2022

View reviewed changes

Comment thread src/transformers/models/layoutlmv2/processing_layoutlmv2.py Outdated

add test for overflowing_tokens, change assert to ValueError, avoidin…

6574644

…g unrelated formatting changes

garyhlai force-pushed the main branch from 1351624 to 6574644 Compare May 6, 2022 22:05

change line length by passing --preview into black

330c8b8

sgugger merged commit e9fd583 into huggingface:main May 9, 2022

ducviet00 mentioned this pull request May 11, 2022

Add LayoutLMv3 #17060

Merged

5 tasks

This was referenced Aug 23, 2022

KeyError 'overflow_to_sample_mapping' when using LayoutXLM with regular Tokenizer + return_overflowing_tokens #18726

Closed

LayoutXLMProcessor: Enforce using "return_overflowing_tokens" with "return_offsets_mapping" #18774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens#17092

LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens#17092
sgugger merged 4 commits intohuggingface:mainfrom
garyhlai:main

garyhlai commented May 5, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 5, 2022 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

Uh oh!

NielsRogge left a comment

Uh oh!

sgugger left a comment

Uh oh!

Uh oh!

sgugger commented May 9, 2022

Uh oh!

timothyjlaurent commented May 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

garyhlai commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Required Input from Reviewers

Who can review?

P.S.

Uh oh!

HuggingFaceDocBuilderDev commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NielsRogge left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sgugger commented May 9, 2022

Uh oh!

timothyjlaurent commented May 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

garyhlai commented May 5, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 5, 2022 •

edited

Loading