I have my own model, which utilize two T5 encoders, and I train it via DeepSpeed. It has it's own save_pretrained() and from_pretrained() methods, which makes a custom load/save logic:
https://github.com/exelents/try_t5_siamese/blob/4140194978ac113c45e7370f40b3d9b932d0b35b/siamese_model.py#L80
When I run training and trainer starts to save checkpoint, there are going something strange: weights file for every saved encoder is going to be e few kilobytes - weights are not going to be saved.
On the start of training trainer tries to load checkpoint using model.load_checkpoint(), but it seems this function has it's own loading logic, because it cannot exec my load model logic and throws an error:
ValueError: [deepspeed] failed to resume from checkpoint ./templates/siamese-t5-small-v1_1-template
I can comment this code, which loads checkpoint, but then I got described before problem with saving checkpoint...
What should I do to make save my own custom model properly? It worked a month ago, but today I refreshed my Transformers repo and everything has broken.
I have my own model, which utilize two T5 encoders, and I train it via DeepSpeed. It has it's own save_pretrained() and from_pretrained() methods, which makes a custom load/save logic:
https://github.com/exelents/try_t5_siamese/blob/4140194978ac113c45e7370f40b3d9b932d0b35b/siamese_model.py#L80
When I run training and trainer starts to save checkpoint, there are going something strange: weights file for every saved encoder is going to be e few kilobytes - weights are not going to be saved.
On the start of training trainer tries to load checkpoint using model.load_checkpoint(), but it seems this function has it's own loading logic, because it cannot exec my load model logic and throws an error:
ValueError: [deepspeed] failed to resume from checkpoint ./templates/siamese-t5-small-v1_1-templateI can comment this code, which loads checkpoint, but then I got described before problem with saving checkpoint...
What should I do to make save my own custom model properly? It worked a month ago, but today I refreshed my Transformers repo and everything has broken.