Generation server using HF accelerate and DS inference#321
Generation server using HF accelerate and DS inference#321mayank31398 wants to merge 1165 commits intobigscience-workshop:bloom-inferencefrom mayank31398:generation-server
Conversation
This reverts commit a40d816.
* Propose a faster preprocessing mechanim by reducing the interprocesses communications * Add flush in order to force print * Try to prevent dead locks * Woops * Trying to figure out what causes deadlock * Limit queue size to 1_000_000 * Drastically reduce the maximum number of element in the queue * Threading does not use a worker * Remove shard files and factorise shard naming * Document high number of worker preprocessing script * Improve naming * Update comments and readmes * Woops * Remove the notion of vanilla and point to the script instead * Rephrase readme to use around 60 cores instead of 40 Co-authored-by: Thomas <ö95242+thomasw21@users.noreply.github.com>
* Training groupings * validation grouping * steps vs samples * iteration time (speed -> samples or iterations per second) * tensorboard group time (from `log_timers_to_tensorboard`) * comment on the writing condition * Update megatron/global_vars.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update megatron/training.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update megatron/training.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update megatron/training.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update megatron/training.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * link bug fix issue on megatron-lm side Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* chore: update requirements.txt * chore: rm deepspeed README already specifies this in greater detail.
* Update gpt2_tokenization.py Adding LRU cache and speeding up tokenization. * Update gpt2_tokenization.py Removing _old method. Note that the chinese token processing is optional and not used currently in training. * Update gpt2_tokenization.py * Update preprocess_data.py The path needs to be set before we can find the "megatron" package. * Update gpt2_tokenization.py Adding comments about max_token_len_cache * Update megatron/tokenizer/gpt2_tokenization.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update gpt2_tokenization.py * Update megatron/tokenizer/gpt2_tokenization.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update gpt2_tokenization.py * Update gpt2_tokenization.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Fix markdown formatting
…e of lr-decay-style
* feat: add glu variant activations * fix: rm extraneous parentheses * feat: rm bias to support jit * fix: replace negative dim with explicit dim * fix: use `x.ndim` for generic dim handling * docs: add note on version for posterity Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * docs: specify jit in `x.ndim` comment Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * test: add simple tests to check activations * fix: use `torch.testing` for tensor checks * test: use seed-controlled random batch inputs Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* indexed_dataset: use numpy to compute byte offsets faster * preprocess with huggingface datasets and mpi * preprocess_dataset_mpi: add --shuffle and --seed options * indexed_dataset: fix to handle file with 0 items * preprocess_dataset_mpi: add --split and --count options * update script comments to reflect shuffle behavior * add torch.distributed version * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * add estimated progress logging * avoid downloading dataset unless user really wants to * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * refactor main into more functions * reformat progress messages * move mpi4py import test to get_args * drop Open MPI variables from init_process_group * add --local_rank to support torch.distributed.launch * update from DeepSpeedExamples * raise exceptions on errors * drop --download option * format byte rate as MB/s * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * move datasets import back to top * import config from datasets Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
* add test suite * add test suite
…#63) * shuffle index list with numpy, scatter list, use file for large lists * drop unused idx_end from index scatter * drop scatter list file to simplify, can add back if needed * rework scatterv, recompute num_samples when needed * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * fix spacing Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
|
@pai4451 I am open to suggestions if you have. |
|
@mayank31398 Thanks, I’m also working on serving BLOOM with DeepSpeed. I think this solution might work, but in terms of serving we have to consider the maintenance cost. The difficult part I think is to keep all processes stable (alive and synchronized). |
|
@pai4451 can you give this latest code a try? |
|
What the current code is doing: The code is working upto line 164 though (when a request is sent). I see tokenized input on all 8 processes. |
@mayank31398 I also get stuck on the line model.generate(). Maybe some processes failed to communicate with others or the processes are not synchronized? I doubt the way to launch the server via |
|
@pai4451 DS inference server is working now. You can use the scripts now. |
|
@stas00 , i would like to contribute this to the bloom-inference branch if its all right? |
Do you think the code |
|
I am not sure. I have tested with 1 node with 8 x 80GB A100 GPUs. Even if you can run it on 2 nodes, the original Megatron-LM paper doesn't recommend spanning tensor parallelism across nodes. |
|
I screwed up this PR |
|
Moving to #325 |
well, ideally all this should go directly to https://github.com/huggingface/transformers/tree/main/examples/research_projects/bloom-inference (the last section doesn't exist yet) so the bloom-inference branch here should be moved there as well. Does your code depend on the script under the bloom-inference branch? If not, perhaps open a separate PR into transformers and tag me on it? At some point I will be doing the same for the bloom-inference branch |
|
Well no @stas00 , but it has a lot of duplicate code for now. That's why re-using the same methods across scripts would be better. Is it possible this is cause by CUDA version = 11.6 ( i am using). |
|
Also, the memory leak in HF accelerate is not seen by @sgugger , so not sure why it is happening with my environment. |
|
I suppose we could start turning the scripts into small libraries that the scripts would pull in. Would it help if I merged the bloom-inference branch, you re-based it and then started converting scripts into libs and re-using the code? |
This PR depends on
There are some redundant methods in some scripts that can be removed once #308 is merged into main branch
This PR is for adding scripts for creating a generation server using both HF accelerate and DeepSpeed inference.