Extend config with task specific configs.#3433
Conversation
f234b1e to
128d7e7
Compare
LysandreJik
left a comment
There was a problem hiding this comment.
I think this makes sense, giving where we've been going up to now.
I would like to understand what is our philosophy with the growing size of the configuration files; for example the bert-base-cased configuration on S3 looks like this:
{
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 28996
}
(which is readable imo) and once it's saved it now looks like this:
{
"_num_labels": 2,
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": null,
"do_sample": false,
"early_stopping": false,
"eos_token_id": null,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"min_length": 0,
"model_type": "bert",
"no_repeat_ngram_size": 0,
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 28996
}
(which is less readable), are we planning to keep them growing as the tokenizer and model configurations are merged? I feel like adding all those attributes to the configuration saves an "experiment" more than a "model". Is this something we're aiming for?
LysandreJik
left a comment
There was a problem hiding this comment.
Other than that interrogation, LGTM!
Might it be possible to only save parameters that are different from the default config of the corresponding model? This would keep it readable. |
| self.label2id = dict((key, int(value)) for key, value in self.label2id.items()) | ||
|
|
||
| # Tokenizer arguments TODO: eventually tokenizer and models should share the same config | ||
| self.prefix = kwargs.pop("prefix", "") |
There was a problem hiding this comment.
should it be None instead of ""?
There was a problem hiding this comment.
I'd prefer "" here because then we can save some if statements and just write:
[config.prefix + text for text in texts]
in pipelines.py for example.
There was a problem hiding this comment.
But then how are you going to not serialize it when calling save_pretrained?
We could do a getter for self.prefix or "" if it's easier syntax-wise
There was a problem hiding this comment.
Agree, will change that!
There was a problem hiding this comment.
My idea was to only serialize attributes that are different from the default config attributes.
So if BartConfig.form_pretrained('bart-base-uncased') is created, we compare it to BartConfig() before serializing and only save params that are different from BartConfig(). It might be easier to just save all non None parameters, but then we would additionally save all parameters that are not None by default.
|
LGTM and I agree with what @LysandreJik and you just said above. Serialized For instance I've always disliked the |
|
After this is merged I can open a new PR that serializes only the non-default values. |
| self.decoder_start_token_id = kwargs.pop("decoder_start_token_id", None) | ||
|
|
||
| # task specific arguments | ||
| self.task_specific_params = kwargs.pop("task_specific_params", None) |
There was a problem hiding this comment.
Should it be kwargs.pop("task_specific_params", {}) so we can directly test its keys?
There was a problem hiding this comment.
Guess it's the same pro/con discussion than the one above about prefix = kwargs.pop("prefix", "") or kwargs.pop("prefix", "") . If it's not None as a default it would at the moment be serialized and saved in all configs that don't have task_specific_params or am I missing something?
|
I agree with what @LysandreJik and @julien-c says about serializing only non-default values by the way. |
As discussed and proposed by @thomwolf in PR #3413, another step towards a combined tokenizer/model config is this PR. It extends the normal config with the following parameters:
In terms of hierarchy for a task-specific generation it would go as follows:
generatemethod ? Yes use these. No - go to 2.task_specific_params dict? Yes use these. No - go to 3.config dict? Yes use these. No - go to 4.PretrainedConfigThese were our arguments in favor of this:
TODO
If you guys are fine with this structure:
task_specific_paramsfor Bart and T5s configs on S3