Avoid importing tensorflow when importing evaluate#135
Avoid importing tensorflow when importing evaluate#135lvwerra merged 5 commits intohuggingface:mainfrom
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
ola13
left a comment
There was a problem hiding this comment.
Looks good, thanks @NouamaneTazi! How did you test the change?
There was a problem hiding this comment.
I think this won't work for users-defined models that inherit from PreTrainedModel.
I would simply import PreTrainedModel /TFPreTrainedModel directly here inside compute instead of importing it at the top of the file.
Calling compute is expensive anyway, so importing tensorflow is ok I think.
(alternatively you can check if tensorflow has already been imported by checking "tensorflow" in sys.modules AFAIK - if it's not been imported you don't need to import TFPreTrainedModel and check if the model inherits from TFPreTrainedModel)
|
Using the fix in this PR, importing And by using the PR huggingface/transformers#17684 as well, from evaluate import evaluator
from datasets import Dataset, load_dataset
e = evaluator("text-classification")
data = Dataset.from_dict(load_dataset("imdb")["test"][:2])
# testing that nothing breaks
# from transformers import TFBertModel, BertModel
# tfmodel = TFBertModel.from_pretrained("julien-c/bert-xsmall-dummy")
# model = BertModel.from_pretrained("julien-c/bert-xsmall-dummy")
results = e.compute(
model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
data=data,
metric="accuracy",
input_column="text",
label_column="label",
label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
strategy="bootstrap",
n_resamples=10,
random_state=0
) |
|
I'm not sure if And I agree about the user-defined models @lhoestq |
|
Testing script for user-defined model import torch
from datasets import Dataset, load_dataset
from transformers import BertConfig, BertTokenizer, PreTrainedModel
from evaluate import evaluator
e = evaluator("text-classification")
data = Dataset.from_dict(load_dataset("imdb")["test"][:2])
class CustomModel(PreTrainedModel):
def __init__(self, config):
super().__init__(config)
def forward(self, *args, **kwargs):
return {'logits': torch.zeros(1, 1, 1)}
model_or_pipeline = CustomModel(BertConfig.from_pretrained("julien-c/bert-xsmall-dummy"))
results = e.compute(
model_or_pipeline=model_or_pipeline,
tokenizer=BertTokenizer.from_pretrained("julien-c/bert-xsmall-dummy"),
data=data,
metric="accuracy",
input_column="text",
label_column="label",
label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
strategy="bootstrap",
n_resamples=10,
random_state=0,
) |
| if isinstance(model_or_pipeline, str) or ( | ||
| hasattr(model_or_pipeline, "__class__") | ||
| and any( | ||
| cls_name in [parent_cls.__name__ for parent_cls in model_or_pipeline.__class__.__mro__] | ||
| for cls_name in ["PreTrainedModel", "TFPreTrainedModel"] | ||
| ) |
There was a problem hiding this comment.
I'm not sure if "tensorflow" in sys.modules is the way to do it. Because if the user uses pipeline, that means they don't know in advance what's the model type right?
If model_or_pipeline is a pipeline, you don't need to check if it inherits from TFPreTrainedModel, so you don't need to import TFPreTrainedModel ;)
I'll let @sgugger give his opinion, but I think this is reasonable:
| if isinstance(model_or_pipeline, str) or ( | |
| hasattr(model_or_pipeline, "__class__") | |
| and any( | |
| cls_name in [parent_cls.__name__ for parent_cls in model_or_pipeline.__class__.__mro__] | |
| for cls_name in ["PreTrainedModel", "TFPreTrainedModel"] | |
| ) | |
| import transformers | |
| if "tensorflow" in sys.modules: | |
| # Check if the model if a TF model only if TF has already been imported. | |
| # Indeed loading `TFPreTrainedModel` may import tensorflow unnecessarily. | |
| transformers_pretrained_model_classes = (transformers.PreTrainedModel, transformers.TFPreTrainedModel) | |
| else: | |
| transformers_pretrained_model_classes = (transformers.PreTrainedModel) | |
| if ( | |
| isinstance(model_or_pipeline, transformers_pretrained_model_classes) | |
| or isinstance(model_or_pipeline, str) |
There was a problem hiding this comment.
After testing this, it seems that after this line from datasets import Dataset, load_dataset, "tensorflow" in sys.modules becomes True. And so it doesn't help with the problem.
Is it just my python?
There was a problem hiding this comment.
Yea it's an issue with datasets, let me fix it
There was a problem hiding this comment.
Actually it comes from huggingface_hub - let me open a PR
There was a problem hiding this comment.
It's already fixed on the main branch of huggingface-hub. I also added a tests here for the future: huggingface/huggingface_hub#904 and huggingface/datasets#4482
There was a problem hiding this comment.
This suggestion might avoid the load yes. The current test as it is useless as no model is ever just a PreTrainedModel or a TFPreTrainedModel. They are always subclasses of those.
There was a problem hiding this comment.
The current test as it is useless as no model is ever just a PreTrainedModel or a TFPreTrainedModel. They are always subclasses of those.
indeed, using isinstance is required here and it's fine to import PreTrainedModel (and also TFPreTrainedModel if tensorflow has been imported by the user) - see my suggestion above
There was a problem hiding this comment.
I said: "This suggestion might avoid the load yes." ;-)
The rest of the comment is on the existing code.
There was a problem hiding this comment.
This is actually not enough to not import TF, since importing pipeline (from transformers.pipelines) does import TF. Is this expected ?
I wouldn't spend too much time in evaluate trying to not load TF at this point. Users can still provide USE_TF=0 to not import TF in transformers.
It's maybe a better solution to lazy import the evaluator module, so that transformers is not imported when evaluate is imported
There was a problem hiding this comment.
Yes pipeline imports TensorFlow if available to load the default model associated with each pipeline. I would recommend not importing pipeline before the time it's necessary.
| if ( | ||
| isinstance(model_or_pipeline, PreTrainedModel) | ||
| or isinstance(model_or_pipeline, TFPreTrainedModel) | ||
| or isinstance(model_or_pipeline, str) | ||
| isinstance(model_or_pipeline, str) | ||
| or isinstance(model_or_pipeline, transformers.PreTrainedModel) | ||
| or isinstance(model_or_pipeline, transformers.TFPreTrainedModel) | ||
| ): |
There was a problem hiding this comment.
After some more debugging, here's what I found:
import tensorflow # doesn't allocate all GPU memory
import transformers # doesn't allocate all GPU memory
from transformers import TFPreTrainedModel # allocates all GPU memorySo as long as we avoid using/importing transformers.TFPreTrainedModel, there shouldn't be a problem. Which is why I believe this is the simplest fix
|
Again, you can test that this is working by using from evaluate import evaluator
from datasets import Dataset, load_dataset
e = evaluator("text-classification")
data = Dataset.from_dict(load_dataset("imdb")["test"][:2])
# testing that nothing breaks
# from transformers import TFBertModel, BertModel
# tfmodel = TFBertModel.from_pretrained("julien-c/bert-xsmall-dummy")
# model = BertModel.from_pretrained("julien-c/bert-xsmall-dummy")
results = e.compute(
model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
data=data,
metric="accuracy",
input_column="text",
label_column="label",
label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
strategy="bootstrap",
n_resamples=10,
random_state=0
)And please note that calling |
lhoestq
left a comment
There was a problem hiding this comment.
Sounds good to me, indeed it won't call transformers.TFPretrainedModel if the input is a string or a pytorch model :)
|
If @sgugger agrees, I think we can merge this. |
|
Double-checked and indeed import pipeline does not allocate GPU memory while import |
|
huggingface/transformers#18044 should fix the fact that importing |
- avoids importing xxPreTrainedModel when checking instance
|
This PR is no longer necessary as the problem was solved from |
|
Still a good idea to have it so we don't need to pin a too new |
Avoid loading
TFPreTrainedModelwhen it's not used because tensorflow allocates all GPU memoryFixes the issue:
Note: also made a fix on
pipelineto work as expected huggingface/transformers#17684