I am trying to use our pipeline() to extract features of sentence tokens.
Because the lengths of my sentences are not same, and I am then going to feed the token features to RNN-based models, I want to padding sentences to a fixed length to get the same size features.
Before knowing our convenient pipeline() method, I am using a general version to get the features, which works fine but inconvenient, like that:
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = 'After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.'
encoded_input = tokenizer(text, padding='max_length', truncation=True, max_length=40)
indexed_tokens = encoded_input['input_ids']
segments_ids = encoded_input['token_type_ids']
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
model = AutoModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model.eval()
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
hidden_states = outputs[2]
Then I also need to merge (or select) the features from returned hidden_states by myself... and finally get a [40,768] padded feature for this sentence's tokens as I want. However, as you can see, it is very inconvenient.
Compared to that, the pipeline method works very well and easily, which only needs the following 5-line codes.
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
nlp = pipeline('feature-extraction', model=model, tokenizer=tokenizer)
text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
features = nlp(text)
Then I can directly get the tokens' features of original (length) sentence, which is [22,768].
However, how can I enable the padding option of the tokenizer in pipeline?
As I saw #9432 and #9576 , I knew that now we can add truncation options to the pipeline object (here is called nlp), so I imitated and wrote this code:
text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
features = nlp(text, padding='max_length', truncation=True, max_length=40)
The program did not throw me an error though, but just return me a [512,768] vector...?
So is there any method to correctly enable the padding options? Thank you!
I am trying to use our pipeline() to extract features of sentence tokens.
Because the lengths of my sentences are not same, and I am then going to feed the token features to RNN-based models, I want to padding sentences to a fixed length to get the same size features.
Before knowing our convenient pipeline() method, I am using a general version to get the features, which works fine but inconvenient, like that:
Then I also need to merge (or select) the features from returned hidden_states by myself... and finally get a [40,768] padded feature for this sentence's tokens as I want. However, as you can see, it is very inconvenient.
Compared to that, the pipeline method works very well and easily, which only needs the following 5-line codes.
Then I can directly get the tokens' features of original (length) sentence, which is [22,768].
However, how can I enable the padding option of the tokenizer in pipeline?
As I saw #9432 and #9576 , I knew that now we can add truncation options to the pipeline object (here is called nlp), so I imitated and wrote this code:
The program did not throw me an error though, but just return me a [512,768] vector...?
So is there any method to correctly enable the padding options? Thank you!