Skip to content

Paged attention new#1522

Closed
Bob-Chen222 wants to merge 52 commits intospecschedulerfrom
paged_attention_new
Closed

Paged attention new#1522
Bob-Chen222 wants to merge 52 commits intospecschedulerfrom
paged_attention_new

Conversation

@Bob-Chen222
Copy link
Copy Markdown
Contributor

@Bob-Chen222 Bob-Chen222 commented Oct 10, 2024

Description of changes:

I have added page attention for the spec scheduler. I will clean up the print statement and add more documentation tomorrow. Let me know if anything needs to be changed!

One thing to notice is that both the specscheduler branch and this branch suffer from an "invalid argument" error in Cuda, but I think a small fix would solve this problem on both branches.


This change is Reviewable

@Bob-Chen222 Bob-Chen222 marked this pull request as draft October 10, 2024 20:09
@Bob-Chen222 Bob-Chen222 marked this pull request as ready for review October 12, 2024 06:59
@chenzhuofu chenzhuofu self-requested a review October 31, 2024 17:07
@Bob-Chen222
Copy link
Copy Markdown
Contributor Author

Some more updates:

  1. added max-kv-cache-size as a flag. At initialization, the page manager will be initialized with the number of hidden layers as input so that the page manager will know how much kv cache will have to be allocated per transformer layer. Then when we init the metadata of inference operators, we call page manager to get the kv cache size needed and allocate the slots accordingly.
  2. add paged attention support for incr_decoding
  3. clean up the comments and reorganize the format.
  4. after all these changes the performance is nearly the same as before

@lockshaw
Copy link
Copy Markdown
Collaborator

lockshaw commented Jan 9, 2025

Moved to flexflow/flexflow-serve#82

@lockshaw lockshaw closed this Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants