fix: continuous batching in transformers serve#40479
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| num_blocks=1, | ||
| block_size=1024, | ||
| do_sample=False, | ||
| max_batch_tokens=10, |
There was a problem hiding this comment.
Happy to help 🤗 I just want to point out that while num_blocks and max_batch_tokens can be inferred from available GPU memory, if block_size is not given it simply defaults to 32, which is quite far from the previous 1024 here. Might not be important though!
fb0c732 to
5f8c994
Compare
a322182 to
b93afe0
Compare
b93afe0 to
bc392de
Compare
LysandreJik
left a comment
There was a problem hiding this comment.
LGTM, thanks @McPatate! The only thing I'm a bit wary about is the change from attn_implementation toggling CB to the explicit flag continuous_batching, especially as the latter still requires the former to be set.
Would it be possible to have the flag --continuous_batching also correctly toggle a paged attention method if not set?
|
I understand, I'm not super sure of which direction I want to go with this. |
|
Sounds good! |
Fixing continuous batching in
transformers serve.--continuous_batchingcmd line flag to enable, open to change this!can't repro my previous error, removed the added code and added a test to check if defaults are set correctlymax_new_tokenscan sometimes beNone, set a default so it doesn't break CB which expects it to be setlifespanto theFastAPIinstancestopTimedModelto makedelete_model"public" so we cancel thethreading.Timerthat was causing the server to hang on SIGINTrequest_id_iterto iterate only on tokens linked to a given request_idget_resultto requeue tokens ifrequest_id is not None && req.request_id != request_id(before we were losing tokens while iterating directly on all output_queue tokens)moved theremoved any trace of tokenizer within CB impl, didn't make sense to have here as we already are expecting encoded tokens. Leaving it up to the caller to decode (updated the serving code adequately)DecodeStreamobject to live in theRequestStaterather than being single instance linked to the managernext_tokenfromRequestStateas it wasn't used, in streaming I've usedgenerated_tokens[-1]to get latest tokenprepare_next_batchsignature, now returns a bool to short circuit inner generation loop when it didn't prepare anything