Feature request
When setting the flag group_by_length=True on the TrainingArguments, there is no user feedback of the operations running in the background, namely getting the list of lengths for all the samples and running the grouping algorithm. This can be a frustrating problem when dealing with large datasets (Millions of samples) on slow IO devices, since it appears that the Trainer is hanging and does not start!
More precisely, In my current setup, I found out that the following lines take almost 2h to finish. (Due to my slow IO (reading from a NFS from an old machine))
|
lengths = [len(feature[model_input_name]) for feature in dataset] |
NOTE 1): using .select_columns(model_input_name) and then iterating would not be faster? Assuming that the dataset has more feature like "attention_mask" for instance.
I believe that more feedback could possibly be given to the user, like the time that would take to finish. (Also store the dataset length under .cache).
NOTE 2): After realising this issue, I also noticed the length_column_name flag. Maybe raising a warning to let the users know that on larger datasets they should precompute the length. By doing so, the time went from 2h to (15-20)min.
Motivation
I was training a model on a LM task. My dataset has 22M samples with average length of +/- 512. When I run the model with group_by_length=True I thought that something was wrong because the training was not starting (I was actually writing an bug about my problem, because I thought it was an issue with the Trainer). After further inspection, I notice that the main culprit was the computation of the length that is really slow on my current setup.
Your contribution
If you feel like this is an issue that is worth to address, I am willing to do PR under your orientation.
Feature request
When setting the flag
group_by_length=Trueon the TrainingArguments, there is no user feedback of the operations running in the background, namely getting the list of lengths for all the samples and running the grouping algorithm. This can be a frustrating problem when dealing with large datasets (Millions of samples) on slow IO devices, since it appears that the Trainer is hanging and does not start!More precisely, In my current setup, I found out that the following lines take almost 2h to finish. (Due to my slow IO (reading from a NFS from an old machine))
transformers/src/transformers/trainer_pt_utils.py
Line 585 in c817c17
NOTE 1): using
.select_columns(model_input_name)and then iterating would not be faster? Assuming that the dataset has more feature like "attention_mask" for instance.I believe that more feedback could possibly be given to the user, like the time that would take to finish. (Also store the dataset length under .cache).
NOTE 2): After realising this issue, I also noticed the
length_column_nameflag. Maybe raising a warning to let the users know that on larger datasets they should precompute the length. By doing so, the time went from 2h to (15-20)min.Motivation
I was training a model on a LM task. My dataset has 22M samples with average length of +/- 512. When I run the model with
group_by_length=TrueI thought that something was wrong because the training was not starting (I was actually writing an bug about my problem, because I thought it was an issue with the Trainer). After further inspection, I notice that the main culprit was the computation of the length that is really slow on my current setup.Your contribution
If you feel like this is an issue that is worth to address, I am willing to do PR under your orientation.