Skip to content

Dynamic parallel processing Size Adjustment for Low Mem Beam Search #28833

Closed
Saibo-creator wants to merge 7 commits intohuggingface:mainfrom
Saibo-creator:low_mem_beam_search_auto_split
Closed

Dynamic parallel processing Size Adjustment for Low Mem Beam Search #28833
Saibo-creator wants to merge 7 commits intohuggingface:mainfrom
Saibo-creator:low_mem_beam_search_auto_split

Conversation

@Saibo-creator
Copy link
Copy Markdown
Contributor

@Saibo-creator Saibo-creator commented Feb 2, 2024

What does this PR do?

TL;DR

This PR addresses feedback from the community, specifically a suggestion from @gante, to enhance memory management in beam search operations without adding complexity through additional flags. This development strikes a balance between performance and usability, ensuring the model dynamically adjusts to various hardware constraints.

Details

This Pull Request (PR) introduces to dynamically adjust the batch size during low memory beam search operations. Our traditional beam search, with a beam width of k and a batch size of n, operates as though the batch size were n*k. The recently introduced low memory beam search improves memory efficiency by dividing the n*k batch into k sub-batches of size n. However, this approach has shown limitations, particularly in two scenarios:

  1. Optimizing for Hardware's Maximum Parallel Processing Capacity (s): In instances where the hardware's maximum parallel processing capacity s falls between n*k and n, our current method might not utilize the available resources efficiently. For example, with n=10, k=10, and S=30, the low memory beam search would execute ten sequential operations with a batch size of 10, whereas it could achieve better throughput with four operations of batch size 25.

  2. Handling Out-Of-Memory (OOM) Errors When s < n: In cases where s is smaller than n, the low memory beam search might encounter OOM errors, even though a further split of the batch could allow the operation to proceed. While one might argue for using smaller batch sizes from the start, this PR provides a solution to optimize processing dynamically.

Implementation Highlights:

  • Dynamic Batch Size Adjustment: By adopting a try/except loop, the system starts with the standard beam search parameters and dynamically reduces the batch size in half upon encountering OOM errors, with a minimum threshold set at 1. This mechanism ensures optimal memory usage and performance efficiency.

  • Global Batch Size Caching: The implementation includes caching the most recent successful batch size in a global variable, optimal_low_mem_beam_search_bs. This approach allows for rapid adaptation to the most efficient processing conditions without the need for rediscovery. As text inputs lengthen and memory usage increases during generation, optimal_low_mem_beam_search_bs is periodically updated to reflect the most current optimal conditions.

API Impact:

This update will be transparent to end users, involving no changes to the existing API. Users can expect improved efficiency without any alteration to the results produced by previous implementations.

Testing:

Existing tests confirm that the results from the low memory beam search align with those from the traditional beam search method. Specific tests for dynamic parallel processing sizes are not yet implemented. If you think it's worth adding some, I have a draft below.

Doc:

Do you think we should mention this is the doc ? Currently we have

            sequential (`bool`, defaults to `False`):
                By default, beam search has `batch_size * num_beams` as effective batch size (see `beam_search()` for
                more details). This flag will avoid parallelizing the beam search and will instead run beam search
                sequentially.

in the doc

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@gante

@Saibo-creator Saibo-creator force-pushed the low_mem_beam_search_auto_split branch from 8f46347 to e516a8f Compare February 5, 2024 12:28
@gante
Copy link
Copy Markdown
Contributor

gante commented Feb 14, 2024

Hi @Saibo-creator 👋

We're doing a sprint to add torch.compile support on generate (tracker), so I'm halting the addition of changes that substantially modify a decoding method until that is complete. In particular, beam search will have to be rewritten, so this PR will likely need to come in a different shape.

I'll keep you updated 🤗

@github-actions
Copy link
Copy Markdown
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions Bot closed this Mar 20, 2024
@Saibo-creator
Copy link
Copy Markdown
Contributor Author

Hey @gante 👋

Any news on the timeline? I can adapt this as needed. 🤗

@gante
Copy link
Copy Markdown
Contributor

gante commented Mar 20, 2024

No, not yet. It's taking longer than we anticipated :)

@Saibo-creator
Copy link
Copy Markdown
Contributor Author

No, not yet. It's taking longer than we anticipated :)

Good luck, vamos!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants