How to enable tokenizer padding option in feature extraction pipeline?

I am trying to use our pipeline() to extract features of sentence tokens.
Because the lengths of my sentences are not same, and I am then going to feed the token features to RNN-based models, I want to padding sentences to a fixed length to get the same size features.
Before knowing our convenient pipeline() method, I am using a general version to get the features, which works fine but inconvenient, like that:
```
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = 'After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.'

encoded_input = tokenizer(text, padding='max_length', truncation=True, max_length=40)
indexed_tokens = encoded_input['input_ids']
segments_ids = encoded_input['token_type_ids']

tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

model = AutoModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model.eval()

with torch.no_grad():
    outputs = model(tokens_tensor, segments_tensors)
    hidden_states = outputs[2]
```
Then I also need to merge (or select) the features from returned **hidden_states** by myself... and finally get a [40,768] padded feature for this sentence's tokens as I want. However, as you can see, it is very inconvenient.
Compared to that, the pipeline method works very well and easily, which only needs the following 5-line codes.
```
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
nlp = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
features = nlp(text)
```
Then I can directly get the tokens' features of original (length) sentence, which is [22,768].

**However, how can I enable the padding option of the tokenizer in pipeline?**
As I saw #9432 and #9576 , I knew that now we can add truncation options to the pipeline object (here is called **nlp**), so I imitated and wrote this code:
```
text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
features = nlp(text, padding='max_length', truncation=True, max_length=40)
```
The program did not throw me an error though, but just return me a [512,768] vector...?
So is there any method to correctly enable the padding options? Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to enable tokenizer padding option in feature extraction pipeline? #9671

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

How to enable tokenizer padding option in feature extraction pipeline? #9671

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions