Dynamic parallel processing Size Adjustment for Low Mem Beam Search by Saibo-creator · Pull Request #28833 · huggingface/transformers

Saibo-creator · 2024-02-02T10:20:05Z

What does this PR do?

TL;DR

This PR addresses feedback from the community, specifically a suggestion from @gante, to enhance memory management in beam search operations without adding complexity through additional flags. This development strikes a balance between performance and usability, ensuring the model dynamically adjusts to various hardware constraints.

Details

This Pull Request (PR) introduces to dynamically adjust the batch size during low memory beam search operations. Our traditional beam search, with a beam width of k and a batch size of n, operates as though the batch size were n*k. The recently introduced low memory beam search improves memory efficiency by dividing the n*k batch into k sub-batches of size n. However, this approach has shown limitations, particularly in two scenarios:

Optimizing for Hardware's Maximum Parallel Processing Capacity (s): In instances where the hardware's maximum parallel processing capacity s falls between n*k and n, our current method might not utilize the available resources efficiently. For example, with n=10, k=10, and S=30, the low memory beam search would execute ten sequential operations with a batch size of 10, whereas it could achieve better throughput with four operations of batch size 25.
Handling Out-Of-Memory (OOM) Errors When s < n: In cases where s is smaller than n, the low memory beam search might encounter OOM errors, even though a further split of the batch could allow the operation to proceed. While one might argue for using smaller batch sizes from the start, this PR provides a solution to optimize processing dynamically.

Implementation Highlights:

Dynamic Batch Size Adjustment: By adopting a try/except loop, the system starts with the standard beam search parameters and dynamically reduces the batch size in half upon encountering OOM errors, with a minimum threshold set at 1. This mechanism ensures optimal memory usage and performance efficiency.
Global Batch Size Caching: The implementation includes caching the most recent successful batch size in a global variable, optimal_low_mem_beam_search_bs. This approach allows for rapid adaptation to the most efficient processing conditions without the need for rediscovery. As text inputs lengthen and memory usage increases during generation, optimal_low_mem_beam_search_bs is periodically updated to reflect the most current optimal conditions.

API Impact:

This update will be transparent to end users, involving no changes to the existing API. Users can expect improved efficiency without any alteration to the results produced by previous implementations.

Testing:

Existing tests confirm that the results from the low memory beam search align with those from the traditional beam search method. Specific tests for dynamic parallel processing sizes are not yet implemented. If you think it's worth adding some, I have a draft below.

Doc:

Do you think we should mention this is the doc ? Currently we have

            sequential (`bool`, defaults to `False`):
                By default, beam search has `batch_size * num_beams` as effective batch size (see `beam_search()` for
                more details). This flag will avoid parallelizing the beam search and will instead run beam search
                sequentially.

in the doc

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@gante

…id researching it in the following runs

error case: split_size = 10, full_batch_size = 5

gante · 2024-02-14T14:23:41Z

Hi @Saibo-creator 👋

We're doing a sprint to add torch.compile support on generate (tracker), so I'm halting the addition of changes that substantially modify a decoding method until that is complete. In particular, beam search will have to be rewritten, so this PR will likely need to come in a different shape.

I'll keep you updated 🤗

github-actions · 2024-03-10T08:03:58Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Saibo-creator · 2024-03-20T20:35:02Z

Hey @gante 👋

Any news on the timeline? I can adapt this as needed. 🤗

gante · 2024-03-20T20:39:52Z

No, not yet. It's taking longer than we anticipated :)

Saibo-creator · 2024-03-21T09:22:01Z

No, not yet. It's taking longer than we anticipated :)

Good luck, vamos!

Saibo-creator added 5 commits February 5, 2024 13:26

implement low mem beam search with binary search best split size

f343127

save the optimal_low_mem_beam_search_bs after the first search to avo…

fa7a542

…id researching it in the following runs

style code

b13482f

style: rename auto_minibatch_forward to auto_sequential_bs_forward

09f6dcf

style code

e516a8f

Saibo-creator force-pushed the low_mem_beam_search_auto_split branch from 8f46347 to e516a8f Compare February 5, 2024 12:28

Saibo-creator added 2 commits February 6, 2024 11:33

fix ValueError: full_batch_size must be divisible by split_size

e65d65f

error case: split_size = 10, full_batch_size = 5

style code

82457ee

github-actions Bot closed this Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic parallel processing Size Adjustment for Low Mem Beam Search #28833

Dynamic parallel processing Size Adjustment for Low Mem Beam Search #28833
Saibo-creator wants to merge 7 commits intohuggingface:mainfrom
Saibo-creator:low_mem_beam_search_auto_split

Saibo-creator commented Feb 2, 2024 •

edited

Loading

Uh oh!

gante commented Feb 14, 2024

Uh oh!

github-actions Bot commented Mar 10, 2024

Uh oh!

Saibo-creator commented Mar 20, 2024

Uh oh!

gante commented Mar 20, 2024

Uh oh!

Saibo-creator commented Mar 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Saibo-creator commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Details

Implementation Highlights:

API Impact:

Testing:

Doc:

Before submitting

Who can review?

Uh oh!

gante commented Feb 14, 2024

Uh oh!

github-actions Bot commented Mar 10, 2024

Uh oh!

Saibo-creator commented Mar 20, 2024

Uh oh!

gante commented Mar 20, 2024

Uh oh!

Saibo-creator commented Mar 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Saibo-creator commented Feb 2, 2024 •

edited

Loading