Introduce Padding-Free Plugin to FMS-Acceleration by achew010 · Pull Request #57 · foundation-model-stack/fms-acceleration

achew010 · 2024-07-29T09:45:20Z

Description

This PR introduces support for a new padding-free plugin to FMS-Acceleration, this will allow for users to speed up their finetuning by performing attention computation without padding. This can be activated through the sft_trainer cli by passing plugin argument padding_free - e.g. --padding_free huggingface

Currently uses a fork of fms-hf-tuning to

Access plugin through sft_trainer argument
load pretokenized dataset

Note

Transformers natively supports padding-free from v4.44.0 if the Transformers version is lower, the plugin will use an internal implementation instead.
Currently only supports datasets that are already pre-tokenized.

Test

The following comparison is between a padded example and a padding free example.

We observe a 27% increase in runtime efficiency through the padding-free plugin, processing the same number of tokens
The improvement is dataset dependent as we see different performance improvements across datasets (see reference PR) possibly due to varying sequence length distributions from each dataset (longer sequences will lead to larger throughputs and more improvement).

Note:
The throughput results from SFTTrainer metrics will include the padding tokens if padding=True (see here). Instead we use train-runtime to compare.

Alpaca

Implementation	Dataset	Model	Max Steps	Num Device	Batch Size Per Device	Train Runtime (secs)	% Increase
Padded	Alpaca	Mistral7B	100	1	4	79.4	-
Padding-Free	Alpaca	Mistral7B	100	1	4	57.8	27

Reproduce

Padded Experiment

export DATASET=alpaca-pretokenized-mistral.json
python -m tuning.sft_trainer --model_name_or_path mistralai/Mistral-7B-v0.1 --packing False --max_seq_len 4096 --learning_rate 2e-5 --torch_dtype float16 --use_flash_attn True --response_template '### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path $DATASET --per_device_train_batch_size 4 --output_dir benchmark_outputs/ilab --skip_memory_metrics False

Result

{'loss': 1.0213, 'grad_norm': 49.03125, 'learning_rate': 2e-05, 'epoch': 0.04}                                                       
{'loss': 1.0554, 'grad_norm': 49.90625, 'learning_rate': 1.7777777777777777e-05, 'epoch': 0.08}                                      
{'loss': 0.9129, 'grad_norm': 41.65625, 'learning_rate': 1.555555555555556e-05, 'epoch': 0.12}                                       
{'loss': 1.1889, 'grad_norm': 71.875, 'learning_rate': 1.3333333333333333e-05, 'epoch': 0.16}                                        
{'loss': 1.5754, 'grad_norm': 59.78125, 'learning_rate': 1.1111111111111113e-05, 'epoch': 0.2}                                       
{'loss': 1.0262, 'grad_norm': 42.25, 'learning_rate': 8.888888888888888e-06, 'epoch': 0.24}                                          
{'loss': 1.0137, 'grad_norm': 35.03125, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.28}                                       
{'loss': 1.066, 'grad_norm': 65.6875, 'learning_rate': 4.444444444444444e-06, 'epoch': 0.32}                                         
{'loss': 1.3277, 'grad_norm': 37.4375, 'learning_rate': 2.222222222222222e-06, 'epoch': 0.36}                                        
{'loss': 1.1137, 'grad_norm': 48.28125, 'learning_rate': 0.0, 'epoch': 0.4}                                                          
100%|
{'train_runtime': 79.4079, 'train_samples_per_second': 5.037, 'train_steps_per_second': 1.259, 'train_tokens_per_second': 2100.547, 'train_loss': 1.130120143890381, 'init_mem_cpu_alloc_delta': -14388334592, 'init_mem_gpu_alloc_delta': 14483611648, 'init_mem_cpu_peaked_delta': 14483914752, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 665673728, 'train_mem_gpu_alloc_delta': 28984274432, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 28999681024, 'before_init_mem_cpu': 15135694848, 'before_init_mem_gpu': 0, 'epoch': 0.4}
100%|

Padding-Free Experiment

Reproduce

export DATASET=alpaca-pretokenized-mistral.json
python -m tuning.sft_trainer --model_name_or_path mistralai/Mistral-7B-v0.1 --padding_free huggingface --packing False --max_seq_len 4096 --learning_rate 2e-5 --torch_dtype float16 --use_flash_attn True --response_template '### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path $DATASET --per_device_train_batch_size 4 --output_dir benchmark_outputs/ilab --skip_memory_metrics False

Result

{'loss': 1.7849, 'grad_norm': 165.0, 'learning_rate': 2e-05, 'epoch': 0.0}
{'loss': 1.433, 'grad_norm': 158.25, 'learning_rate': 1.7777777777777777e-05, 'epoch': 0.0}
{'loss': 1.2872, 'grad_norm': 60.90625, 'learning_rate': 1.555555555555556e-05, 'epoch': 0.0}
{'loss': 1.2817, 'grad_norm': 93.625, 'learning_rate': 1.3333333333333333e-05, 'epoch': 0.0}
{'loss': 1.1573, 'grad_norm': 41.65625, 'learning_rate': 1.1111111111111113e-05, 'epoch': 0.0}
{'loss': 1.0525, 'grad_norm': 42.03125, 'learning_rate': 8.888888888888888e-06, 'epoch': 0.0}
{'loss': 1.9564, 'grad_norm': 125.1875, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.01}
{'loss': 1.0277, 'grad_norm': 44.40625, 'learning_rate': 4.444444444444444e-06, 'epoch': 0.01}
{'loss': 0.9661, 'grad_norm': 31.546875, 'learning_rate': 2.222222222222222e-06, 'epoch': 0.01}
{'loss': 0.9497, 'grad_norm': 27.140625, 'learning_rate': 0.0, 'epoch': 0.01}
{'train_runtime': 57.805, 'train_samples_per_second': 6.92, 'train_steps_per_second': 1.73, 'train_tokens_per_second': 2383.876, 'train_loss': 1.2896488857269288, 'init_mem_cpu_alloc_delta': -14387732480, 'init_mem_gpu_alloc_delta': 14483611648, 'init_mem_cpu_peaked_delta': 14483365888, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 652550144, 'train_mem_gpu_alloc_delta': 28984245248, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 28990169600, 'before_init_mem_cpu': 15090880512, 'before_init_mem_gpu': 0, 'epoch': 0.01}

fabianlim · 2024-07-29T12:20:48Z

Make sure go through this checklist https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/framework#adding-new-plugins

For benches maybe we can think about how to make a seperate set from the current set. Since this is completely seperate from other plugins, so that we do not have to rerun all the benches everytime. This will require some changes to the benchmarking. Maybe one simple solution is to just have a difference scenarios-ilab.yaml for

Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>

* edits to readme Signed-off-by: 1000960000 user <aaron.chew1@ibm.com> * Apply suggestions from code review Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com> Signed-off-by: 1000960000 user <aaron.chew1@ibm.com> * more readme changes Signed-off-by: 1000960000 user <aaron.chew1@ibm.com> --------- Signed-off-by: 1000960000 user <aaron.chew1@ibm.com> Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>

achew010 changed the title ~~Refactor/ilab plugin~~ Introduce Padding-Free Plugin to FMS-Acceleration Jul 29, 2024

achew010 marked this pull request as ready for review July 29, 2024 09:46

achew010 requested a review from fabianlim as a code owner July 29, 2024 09:46

achew010 force-pushed the refactor/ilab-plugin branch from 3f08e09 to decc009 Compare July 29, 2024 10:21

fabianlim reviewed Jul 29, 2024

View reviewed changes

Comment thread plugins/instruct-lab/configs/fast_attn.yaml Outdated

fabianlim requested changes Jul 29, 2024

View reviewed changes

fabianlim reviewed Jul 29, 2024

View reviewed changes

Comment thread plugins/instruct-lab/README.md

fabianlim reviewed Jul 29, 2024

View reviewed changes

Comment thread plugins/instruct-lab/configs/fast_attn.yaml

achew010 force-pushed the refactor/ilab-plugin branch from 71321a1 to 3238801 Compare August 1, 2024 06:12