-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Efficient Attention #366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient Attention #366
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
@patrickvonplaten Perhaps it is worth looking at this implementation, which also adapts if there is a lack of VRAM by reducing the speed of inference. But if it is not there, then it works just as fast. |
|
Hey @Inkorak, I think this is more or less exactly the implementation I used - just the checkpointing I think we haven't implemented yet. |
|
@patrickvonplaten As far as I understand, this implementation looks at how much VRAM is free and chunks things accordingly. It's only chunking as much as it needs to retain performance. |
|
I see! That's smart! I think for this codebase here it's a bit too much "black magic" and requires to much complex code. We value readability, and intuitive code design a lot and I think here we have a bit too much logic which doesn't necessarily correspond to the attention mechanism itself. In my experience, just setting the "chunk size" to half the number of attention heads works pretty well and we only loose 10% speed. |
| def set_attention_chunk(self, chunk_size: Optional[Union[str, int]] = "auto"): | ||
| # set chunk_size = `None` to disable `set_attention_chunk` | ||
| if chunk_size == "auto": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about having it as a setter function, but adding another argument to __call__ isn't ideal either...
My reasoning is that we'll have multiple levels of optimization (fp16, accelerate loader, chunked softmax), and it would be great to have one place to control them all.
But maybe @patil-suraj or @pcuenca have a better idea where to put it API-wise, I'm just a bit conflicted about having a function here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - don't really see the problem here to be honest. Not that this is more or less exactly the same as gradient_checkpointing_enable() which is also a setter function (we trade speed for memory). fp16 is also a setter function. Also I don't fully understand what you mean by "one place" for all?
Other options are:
- putting it in the config: no-go as it's not at all related to the architecture of the model
- forward argument to
forward(): Don't like this- IMO parameters to the forward function should be limited to either tensors that are used or flags that change the type of output that is returned (e.g.
return_dict,output_type). Optimization parameters should be setters IMO (like gradient checkpointing,half(), ....)
- IMO parameters to the forward function should be limited to either tensors that are used or flags that change the type of output that is returned (e.g.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Patrick here. The other options are aren't really ideal, so let's go with this.
Although, not sure if it's best keep this in the pipeline, what if we do
pipe.unet.set_attention_chunkas this is specific to the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point but we don't have a general good default for all unets - it's a good default for stable diffusion so I think we need a pipeline function here
|
Awesome stuff! Imho Best I can think of right now is: |
Naming is indeed important here and I'm also not super happy with chunked attention. Maybe |
|
|
|
Good points! Another option that came to mind: |
patil-suraj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this! My only main comment is should we expose the set_attention_chunk as it's very much related to the unet model. what do we think about doing pipe.unet.set_attention_chunk ?
| def set_attention_chunk(self, chunk_size: Optional[Union[str, int]] = "auto"): | ||
| # set chunk_size = `None` to disable `set_attention_chunk` | ||
| if chunk_size == "auto": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Patrick here. The other options are aren't really ideal, so let's go with this.
Although, not sure if it's best keep this in the pipeline, what if we do
pipe.unet.set_attention_chunkas this is specific to the model.
| if chunk_size == "auto": | ||
| # half the attention head size is usually a good trade-off between | ||
| # speed and memory | ||
| chunk_size = self.unet.config.attention_head_dim // 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think this should be handled in the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't agree - this is a good default for stable diffusion, not for general UNets
|
|
Want to go away from |
…efficient_attentino
|
|
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* up * add tests * correct * up * finish * better naming * Update README.md Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
1. This commit adds stable-diffusion as a part of shark web. 2. The V-diffusion model has been disabled for now as it's not working(will raise a different patch with fix). 3. Add standard output in the web ui. 4. Add instructions to launch the shark-web. Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com> Signed-off-by: Gaurav Shukla <gaurav@nod-labs.com>
* up * add tests * correct * up * finish * better naming * Update README.md Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
100% of the credit goes to the amazing thread here: basujindal/stable-diffusion#117
By doing:
VRAM can be reduced from 4.6GB to 3.4GB at only 10% slower inference. Try it out: