Dynamic parallel processing Size Adjustment for Low Mem Beam Search #28833
Dynamic parallel processing Size Adjustment for Low Mem Beam Search #28833Saibo-creator wants to merge 7 commits intohuggingface:mainfrom
Conversation
…id researching it in the following runs
8f46347 to
e516a8f
Compare
error case: split_size = 10, full_batch_size = 5
|
Hi @Saibo-creator 👋 We're doing a sprint to add I'll keep you updated 🤗 |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Hey @gante 👋 Any news on the timeline? I can adapt this as needed. 🤗 |
|
No, not yet. It's taking longer than we anticipated :) |
Good luck, vamos! |
What does this PR do?
TL;DR
This PR addresses feedback from the community, specifically a suggestion from @gante, to enhance memory management in beam search operations without adding complexity through additional flags. This development strikes a balance between performance and usability, ensuring the model dynamically adjusts to various hardware constraints.
Details
This Pull Request (PR) introduces to dynamically adjust the batch size during low memory beam search operations. Our traditional beam search, with a beam width of
kand a batch size ofn, operates as though the batch size weren*k. The recently introduced low memory beam search improves memory efficiency by dividing then*kbatch intoksub-batches of sizen. However, this approach has shown limitations, particularly in two scenarios:Optimizing for Hardware's Maximum Parallel Processing Capacity (
s): In instances where the hardware's maximum parallel processing capacitysfalls betweenn*kandn, our current method might not utilize the available resources efficiently. For example, withn=10,k=10, andS=30, the low memory beam search would execute ten sequential operations with a batch size of 10, whereas it could achieve better throughput with four operations of batch size 25.Handling Out-Of-Memory (OOM) Errors When
s<n: In cases wheresis smaller thann, the low memory beam search might encounter OOM errors, even though a further split of the batch could allow the operation to proceed. While one might argue for using smaller batch sizes from the start, this PR provides a solution to optimize processing dynamically.Implementation Highlights:
Dynamic Batch Size Adjustment: By adopting a try/except loop, the system starts with the standard beam search parameters and dynamically reduces the batch size in half upon encountering OOM errors, with a minimum threshold set at 1. This mechanism ensures optimal memory usage and performance efficiency.
Global Batch Size Caching: The implementation includes caching the most recent successful batch size in a global variable,
optimal_low_mem_beam_search_bs. This approach allows for rapid adaptation to the most efficient processing conditions without the need for rediscovery. As text inputs lengthen and memory usage increases during generation,optimal_low_mem_beam_search_bsis periodically updated to reflect the most current optimal conditions.API Impact:
This update will be transparent to end users, involving no changes to the existing API. Users can expect improved efficiency without any alteration to the results produced by previous implementations.
Testing:
Existing tests confirm that the results from the low memory beam search align with those from the traditional beam search method. Specific tests for dynamic parallel processing sizes are not yet implemented. If you think it's worth adding some, I have a draft below.
Doc:
Do you think we should mention this is the doc ? Currently we have
in the doc
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@gante