Skip to content

Add time progress bar to track the group_by_length computation for bigger datasets on Trainer #28069

@T-Almeida

Description

@T-Almeida

Feature request

When setting the flag group_by_length=True on the TrainingArguments, there is no user feedback of the operations running in the background, namely getting the list of lengths for all the samples and running the grouping algorithm. This can be a frustrating problem when dealing with large datasets (Millions of samples) on slow IO devices, since it appears that the Trainer is hanging and does not start!

More precisely, In my current setup, I found out that the following lines take almost 2h to finish. (Due to my slow IO (reading from a NFS from an old machine))

lengths = [len(feature[model_input_name]) for feature in dataset]

NOTE 1): using .select_columns(model_input_name) and then iterating would not be faster? Assuming that the dataset has more feature like "attention_mask" for instance.

I believe that more feedback could possibly be given to the user, like the time that would take to finish. (Also store the dataset length under .cache).

NOTE 2): After realising this issue, I also noticed the length_column_name flag. Maybe raising a warning to let the users know that on larger datasets they should precompute the length. By doing so, the time went from 2h to (15-20)min.

Motivation

I was training a model on a LM task. My dataset has 22M samples with average length of +/- 512. When I run the model with group_by_length=True I thought that something was wrong because the training was not starting (I was actually writing an bug about my problem, because I thought it was an issue with the Trainer). After further inspection, I notice that the main culprit was the computation of the length that is really slow on my current setup.

Your contribution

If you feel like this is an issue that is worth to address, I am willing to do PR under your orientation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions