Skip to content

Introduce Padding-Free Plugin to FMS-Acceleration#57

Merged
fabianlim merged 9 commits intofoundation-model-stack:mainfrom
achew010:refactor/ilab-plugin
Aug 1, 2024
Merged

Introduce Padding-Free Plugin to FMS-Acceleration#57
fabianlim merged 9 commits intofoundation-model-stack:mainfrom
achew010:refactor/ilab-plugin

Conversation

@achew010
Copy link
Copy Markdown
Contributor

@achew010 achew010 commented Jul 29, 2024

Description

This PR introduces support for a new padding-free plugin to FMS-Acceleration, this will allow for users to speed up their finetuning by performing attention computation without padding. This can be activated through the sft_trainer cli by passing plugin argument padding_free - e.g. --padding_free huggingface

Currently uses a fork of fms-hf-tuning to

  • Access plugin through sft_trainer argument
  • load pretokenized dataset

Note

  • Transformers natively supports padding-free from v4.44.0 if the Transformers version is lower, the plugin will use an internal implementation instead.
  • Currently only supports datasets that are already pre-tokenized.

Test

The following comparison is between a padded example and a padding free example.

  • We observe a 27% increase in runtime efficiency through the padding-free plugin, processing the same number of tokens

  • The improvement is dataset dependent as we see different performance improvements across datasets (see reference PR) possibly due to varying sequence length distributions from each dataset (longer sequences will lead to larger throughputs and more improvement).

Note:
The throughput results from SFTTrainer metrics will include the padding tokens if padding=True (see here). Instead we use train-runtime to compare.

Alpaca

Implementation Dataset Model Max Steps Num Device Batch Size Per Device Train Runtime (secs) % Increase
Padded Alpaca Mistral7B 100 1 4 79.4 -
Padding-Free Alpaca Mistral7B 100 1 4 57.8 27
Reproduce

Padded Experiment

export DATASET=alpaca-pretokenized-mistral.json
python -m tuning.sft_trainer --model_name_or_path mistralai/Mistral-7B-v0.1 --packing False --max_seq_len 4096 --learning_rate 2e-5 --torch_dtype float16 --use_flash_attn True --response_template '### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path $DATASET --per_device_train_batch_size 4 --output_dir benchmark_outputs/ilab --skip_memory_metrics False
Result
{'loss': 1.0213, 'grad_norm': 49.03125, 'learning_rate': 2e-05, 'epoch': 0.04}                                                       
{'loss': 1.0554, 'grad_norm': 49.90625, 'learning_rate': 1.7777777777777777e-05, 'epoch': 0.08}                                      
{'loss': 0.9129, 'grad_norm': 41.65625, 'learning_rate': 1.555555555555556e-05, 'epoch': 0.12}                                       
{'loss': 1.1889, 'grad_norm': 71.875, 'learning_rate': 1.3333333333333333e-05, 'epoch': 0.16}                                        
{'loss': 1.5754, 'grad_norm': 59.78125, 'learning_rate': 1.1111111111111113e-05, 'epoch': 0.2}                                       
{'loss': 1.0262, 'grad_norm': 42.25, 'learning_rate': 8.888888888888888e-06, 'epoch': 0.24}                                          
{'loss': 1.0137, 'grad_norm': 35.03125, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.28}                                       
{'loss': 1.066, 'grad_norm': 65.6875, 'learning_rate': 4.444444444444444e-06, 'epoch': 0.32}                                         
{'loss': 1.3277, 'grad_norm': 37.4375, 'learning_rate': 2.222222222222222e-06, 'epoch': 0.36}                                        
{'loss': 1.1137, 'grad_norm': 48.28125, 'learning_rate': 0.0, 'epoch': 0.4}                                                          
100%|
{'train_runtime': 79.4079, 'train_samples_per_second': 5.037, 'train_steps_per_second': 1.259, 'train_tokens_per_second': 2100.547, 'train_loss': 1.130120143890381, 'init_mem_cpu_alloc_delta': -14388334592, 'init_mem_gpu_alloc_delta': 14483611648, 'init_mem_cpu_peaked_delta': 14483914752, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 665673728, 'train_mem_gpu_alloc_delta': 28984274432, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 28999681024, 'before_init_mem_cpu': 15135694848, 'before_init_mem_gpu': 0, 'epoch': 0.4}
100%|

Padding-Free Experiment

Reproduce
export DATASET=alpaca-pretokenized-mistral.json
python -m tuning.sft_trainer --model_name_or_path mistralai/Mistral-7B-v0.1 --padding_free huggingface --packing False --max_seq_len 4096 --learning_rate 2e-5 --torch_dtype float16 --use_flash_attn True --response_template '### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path $DATASET --per_device_train_batch_size 4 --output_dir benchmark_outputs/ilab --skip_memory_metrics False
Result
{'loss': 1.7849, 'grad_norm': 165.0, 'learning_rate': 2e-05, 'epoch': 0.0}
{'loss': 1.433, 'grad_norm': 158.25, 'learning_rate': 1.7777777777777777e-05, 'epoch': 0.0}
{'loss': 1.2872, 'grad_norm': 60.90625, 'learning_rate': 1.555555555555556e-05, 'epoch': 0.0}
{'loss': 1.2817, 'grad_norm': 93.625, 'learning_rate': 1.3333333333333333e-05, 'epoch': 0.0}
{'loss': 1.1573, 'grad_norm': 41.65625, 'learning_rate': 1.1111111111111113e-05, 'epoch': 0.0}
{'loss': 1.0525, 'grad_norm': 42.03125, 'learning_rate': 8.888888888888888e-06, 'epoch': 0.0}
{'loss': 1.9564, 'grad_norm': 125.1875, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.01}
{'loss': 1.0277, 'grad_norm': 44.40625, 'learning_rate': 4.444444444444444e-06, 'epoch': 0.01}
{'loss': 0.9661, 'grad_norm': 31.546875, 'learning_rate': 2.222222222222222e-06, 'epoch': 0.01}
{'loss': 0.9497, 'grad_norm': 27.140625, 'learning_rate': 0.0, 'epoch': 0.01}
{'train_runtime': 57.805, 'train_samples_per_second': 6.92, 'train_steps_per_second': 1.73, 'train_tokens_per_second': 2383.876, 'train_loss': 1.2896488857269288, 'init_mem_cpu_alloc_delta': -14387732480, 'init_mem_gpu_alloc_delta': 14483611648, 'init_mem_cpu_peaked_delta': 14483365888, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 652550144, 'train_mem_gpu_alloc_delta': 28984245248, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 28990169600, 'before_init_mem_cpu': 15090880512, 'before_init_mem_gpu': 0, 'epoch': 0.01}

@achew010 achew010 changed the title Refactor/ilab plugin Introduce Padding-Free Plugin to FMS-Acceleration Jul 29, 2024
@achew010 achew010 marked this pull request as ready for review July 29, 2024 09:46
@achew010 achew010 requested a review from fabianlim as a code owner July 29, 2024 09:46
@achew010 achew010 force-pushed the refactor/ilab-plugin branch from 3f08e09 to decc009 Compare July 29, 2024 10:21
Comment thread plugins/instruct-lab/configs/fast_attn.yaml Outdated
Comment thread plugins/instruct-lab/configs/fast_attn.yaml Outdated
Comment thread plugins/instruct-lab/src/fms_acceleration_ilab/__init__.py Outdated
Comment thread plugins/instruct-lab/src/fms_acceleration_ilab/framework_plugin_padding_free.py Outdated
Comment thread plugins/instruct-lab/src/fms_acceleration_ilab/framework_plugin_padding_free.py Outdated
Comment thread plugins/instruct-lab/README.md
Comment thread plugins/instruct-lab/configs/fast_attn.yaml
@fabianlim
Copy link
Copy Markdown
Contributor

fabianlim commented Jul 29, 2024

Make sure go through this checklist https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/framework#adding-new-plugins

For benches maybe we can think about how to make a seperate set from the current set. Since this is completely seperate from other plugins, so that we do not have to rerun all the benches everytime. This will require some changes to the benchmarking. Maybe one simple solution is to just have a difference scenarios-ilab.yaml for

@achew010 achew010 force-pushed the refactor/ilab-plugin branch from 71321a1 to 3238801 Compare August 1, 2024 06:12
Comment thread plugins/instruct-lab/src/fms_acceleration_ilab/ilab_utils.py
Comment thread plugins/instruct-lab/pyproject.toml Outdated
Comment thread plugins/instruct-lab/pyproject.toml
Comment thread plugins/instruct-lab/src/fms_acceleration_ilab/flash_attn.py Outdated
Comment thread plugins/instruct-lab/src/fms_acceleration_ilab/framework_plugin_padding_free.py Outdated
@achew010 achew010 force-pushed the refactor/ilab-plugin branch from 915ba17 to bff3128 Compare August 1, 2024 08:21
achew010 and others added 9 commits August 1, 2024 09:35
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
@achew010 achew010 force-pushed the refactor/ilab-plugin branch from 66f9cc2 to c9e355a Compare August 1, 2024 09:36
@fabianlim fabianlim merged commit a6f6ef0 into foundation-model-stack:main Aug 1, 2024
fabianlim added a commit that referenced this pull request Aug 2, 2024
* edits to readme

Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>

* Apply suggestions from code review

Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>

* more readme changes

Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>

---------

Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants