Floating-point operations logging in trainer by TevenLeScao · Pull Request #6768 · huggingface/transformers

TevenLeScao · 2020-08-27T17:06:00Z

First of two PRs to implement #4847 :

logging loss vs floating-point operations
using the results for scaling laws analysis

This directly logs floating-point operations in wandb and comet, and creates a log_history.json file with training metrics. To do so, it adds methods to PretrainedModel to count parameters with and without embeddings, and the number of floating-point operations. It also has a few Trainer fixes, most importantly averaging the eval loss across processes rather than logging the one in process 0, and a bug with checkpoint folder creation.

codecov · 2020-08-28T15:27:14Z

Codecov Report

Merging #6768 into master will increase coverage by 1.17%.
The diff coverage is 38.57%.

@@            Coverage Diff             @@
##           master    #6768      +/-   ##
==========================================
+ Coverage   78.47%   79.65%   +1.17%     
==========================================
  Files         157      157              
  Lines       28569    28625      +56     
==========================================
+ Hits        22420    22800     +380     
+ Misses       6149     5825     -324

Impacted Files	Coverage Δ
src/transformers/trainer.py	`51.85% <31.48%> (-1.81%)`	⬇️
src/transformers/modeling_utils.py	`86.66% <62.50%> (-0.84%)`	⬇️
src/transformers/modeling_tf_xlm.py	`88.42% <0.00%> (-4.85%)`	⬇️
src/transformers/file_utils.py	`82.41% <0.00%> (-0.26%)`	⬇️
src/transformers/modeling_bart.py	`95.56% <0.00%> (+0.17%)`	⬆️
src/transformers/configuration_bart.py	`94.00% <0.00%> (+4.00%)`	⬆️
src/transformers/tokenization_xlnet.py	`90.09% <0.00%> (+23.42%)`	⬆️
src/transformers/modeling_tf_transfo_xl.py	`88.13% <0.00%> (+68.28%)`	⬆️
...c/transformers/modeling_tf_transfo_xl_utilities.py	`86.00% <0.00%> (+76.00%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42fddac...4becfac. Read the comment docs.

sgugger

Thanks for the PR! Got a few comments on my side.

src/transformers/modeling_utils.py

sgugger · 2020-08-31T12:30:04Z

src/transformers/trainer.py


                tr_loss += self.training_step(model, inputs)

+                try:


Can we make a cleaner test with isinstance(model, nn.DataParallel)?

sgugger · 2020-08-31T12:31:10Z

src/transformers/trainer.py

            raise ValueError("Trainer.model appears to not be a PreTrainedModel")

        xm.rendezvous("saving_checkpoint")
+        # Storing the number of floating-point operations that went into the model


Those 7 lines are duplicated, maybe put them in a private method to refactor a bit?

agreed, done

sgugger · 2020-08-31T12:32:26Z

src/transformers/trainer.py

+            concat = concat[:num_total_examples]
+        return concat
+
+    def distributed_broadcast_scalars(


This doesn't seem to use self (distributed_concat neither) so maybe those two methods should be functions?

Agree and will move them out, do you think we should keep a redirection to keep distributed_concat backwards-compatible?

src/transformers/trainer.py

sgugger · 2020-08-31T12:33:59Z

src/transformers/trainer.py

+        another model, either implement such a method in the model or override this method.
+
+        Args:
+            model (:obj:`nn.Module`):


Can't we use self.model?

yep, changed it, allows us to save a few lines in the main method too

We can remove the docstring as well

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

LysandreJik

LGTM, left a few comments.

LysandreJik · 2020-09-03T12:11:07Z

src/transformers/trainer.py

+                # in case the model has no config
+                combined_dict = {**self.args.to_sanitized_dict()}


Is there an example of a model without a configuration?

Ah yes, it's something @sgugger mentioned as well - when writing for Trainer we assume that the model is a PretrainedModel, but the one in the test doesn't inherit from PretrainedModel which is why I put this in. @julien-c also liked the idea of Trainer being domain-agnostic (eg not only NLP for example) so I figured might as well put this line in since it isn't expensive. I think in the end it's something we might want to think about since there's a lot of references to model.config (for example if training on TPU, which the test doesn't test for)

LysandreJik · 2020-09-03T12:11:36Z

src/transformers/trainer.py

+                self.total_flos = getattr(model.config, "total_flos", 0)
+


wouldn't this fail if the model didn't have a config?

Yes, the dummy test model doesn't go through it since it doesn't have a method to calculate flos so I didn't catch it! See above, I think we might have to decide whether we want to assume it has a config or not

LysandreJik · 2020-09-03T12:12:34Z

src/transformers/trainer.py

+        another model, either implement such a method in the model or override this method.
+
+        Args:
+            model (:obj:`nn.Module`):


We can remove the docstring as well

LysandreJik · 2020-09-07T08:47:57Z

I think even with domain-agnostic models we'd like to keep the configuration, no? I'm not sure the trainer would behave correctly without a configuration, so if we want to remove the dependency towards configurations, we might as well do it all at once, right?

Would the goal be to have the trainer accept all nn.Modules?

sgugger · 2020-09-08T11:47:20Z

Like agreed upon internally, we will move to Trainer accepting models instantiating a base abstractclass/conforming to some protocol. I think the config will be in the required field but have to work a bit more on this to be sure.

In any case, this is work for a subsequent PR :-)

* neFLOs calculation, logging, and reloading (huggingface#1) * testing distributed consecutive batches * fixed AttributeError from DataParallel * removed verbosity * rotate with use_mtime=True * removed print * fixed interaction with gradient accumulation * indent formatting * distributed neflo counting * fixed typo * fixed typo * mean distributed losses * exporting log history * moved a few functions * floating_point_ops clarification for transformers with parameter-reuse * code quality * double import * made flo estimation more task-agnostic * only logging flos if computed * code quality * unused import * Update src/transformers/trainer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/modeling_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Sylvain review * Update src/transformers/modeling_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * black Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

TevenLeScao added 20 commits July 28, 2020 15:31

neFLOs calculation, logging, and reloading (#1)

9ee591e

Merge branch 'master' of https://github.com/huggingface/transformers

a49f2aa

testing distributed consecutive batches

b50d3e1

fixed AttributeError from DataParallel

6818ed2

removed verbosity

5324678

rotate with use_mtime=True

2636bb8

removed print

04e471b

Merge branch 'master' of https://github.com/huggingface/transformers

f78de89

fixed interaction with gradient accumulation

9e7c05a

indent formatting

8def613

Merged with comet integration PR

52635d6

nlp-trainer integration merge

7b8c0ce

Merge branch 'master' of https://github.com/huggingface/transformers

245df7c

distributed neflo counting

70f919f

fixed typo

349e916

fixed typo

9cc578d

mean distributed losses

03fe015

exporting log history

fa43ae1

moved a few functions

e7a249f

floating_point_ops clarification for transformers with parameter-reuse

45f5fcb

TevenLeScao requested review from LysandreJik, julien-c and sgugger August 27, 2020 17:06

TevenLeScao added 7 commits August 27, 2020 19:17

Merged with hyperparam change

ab49c08

code quality

69d2b1e

double import

d796eef

made flo estimation more task-agnostic

c175142

only logging flos if computed

1773dd6

code quality

4610852

unused import

fae5254

sgugger reviewed Aug 31, 2020

View reviewed changes

TevenLeScao and others added 6 commits August 31, 2020 16:08

Update src/transformers/trainer.py

6f1b48c

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/modeling_utils.py

304ebe8

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Sylvain review

8ec3ea6

Update src/transformers/modeling_utils.py

4becfac

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Merge branch 'master' into master

1aaaa19

black

eb9d328

LysandreJik approved these changes Sep 3, 2020

View reviewed changes

LysandreJik merged commit 01d340a into huggingface:master Sep 8, 2020

		# in case the model has no config
		combined_dict = {**self.args.to_sanitized_dict()}

Conversation

TevenLeScao commented Aug 27, 2020

Uh oh!

codecov bot commented Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TevenLeScao Sep 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LysandreJik commented Sep 7, 2020

Uh oh!

sgugger commented Sep 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Aug 28, 2020 •

edited

Loading

TevenLeScao Sep 3, 2020 •

edited

Loading