Add time progress bar to track the group_by_length computation for bigger datasets on Trainer

### Feature request

When setting the flag `group_by_length=True` on the TrainingArguments, there is no user feedback of the operations running in the background, namely getting the list of lengths for all the samples and running the grouping algorithm. This can be a frustrating problem when dealing with large datasets (Millions of samples) on slow IO devices, since it appears that the Trainer is hanging and does not start!

More precisely, In my current setup, I found out that the following lines take almost 2h to finish. (Due to my slow IO (reading from a NFS from an old machine))

https://github.com/huggingface/transformers/blob/c817c17dbe264329b9f9d227b48ce70edd9e3204/src/transformers/trainer_pt_utils.py#L585

NOTE 1): using `.select_columns(model_input_name)` and then iterating would not be faster? Assuming that the dataset has more feature like "attention_mask" for instance. 

I believe that more feedback could possibly be given to the user, like the time that would take to finish. (Also store the dataset length under .cache).

NOTE 2): After realising this issue, I also noticed the `length_column_name` flag. Maybe raising a warning to let the users know that on larger datasets they should precompute the length. By doing so, the time went from 2h to (15-20)min.

### Motivation

I was training a model on a LM task. My dataset has 22M samples with average length of +/- 512. When I run the model with `group_by_length=True` I thought that something was wrong because the training was not starting (I was actually writing an bug about my problem, because I thought it was an issue with the Trainer). After further inspection, I notice that the main culprit was the computation of the length that is really slow on my current setup. 

### Your contribution

If you feel like this is an issue that is worth to address, I am willing to do PR under your orientation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add time progress bar to track the group_by_length computation for bigger datasets on Trainer #28069

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add time progress bar to track the group_by_length computation for bigger datasets on Trainer #28069

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions