Use `torch.bool` instead of `torch.int64` for non-persistant causal mask buffer by fxmarty · Pull Request #29241 · huggingface/transformers

fxmarty · 2024-02-23T11:13:11Z

Adding self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False) in @ArthurZucker's rewrite of llama & gemma adds a 500 MB overhead when serializing to ONNX/TorchScript IR/PyTorch ExportedProgram (from https://pytorch.org/docs/stable/export.html), for max_position_embeddings=8182.

Essentially, these IRs do not support non-persistent buffers. One quick fix is to use torch.bool instead of torch.int64, but bool is still 8-bits in pytorch (pytorch/pytorch#41571) & the overhead is still ~70 MB.

The lowered overhead is acceptable to me, but this won't scale to 10M context length.

HuggingFaceDocBuilderDev · 2024-02-23T11:32:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

xenova · 2024-02-23T15:35:41Z

Can confirm this shrunk my tiny-random-GemmeForCausalLM ONNX export from ~500MB to ~70MB (PR). Ideally, there would be no overhead, but I think this helps a ton for now!

amyeroberts

LGTM - thanks for digging into this and fixing!

Happy to merge once slow model tests for gemma and llama are confirmed to be passing.

fxmarty · 2024-02-26T13:06:32Z

@amyeroberts Running on A100, I can confirm that no additional tests are failing with RUN_SLOW=1 CUDA_VISIBLE_DEVICES=2 pytest tests/ -k "llama or gemma" -s -vvvvv compared to running on main.

…ask buffer (#29241) use torch.bool instead of torch.int64

use torch.bool instead of torch.int64

6a28a13

fxmarty requested a review from ArthurZucker February 23, 2024 11:13

fxmarty requested a review from amyeroberts February 26, 2024 10:28

amyeroberts approved these changes Feb 26, 2024

View reviewed changes

fxmarty merged commit 24d59c7 into huggingface:main Feb 26, 2024

ArthurZucker mentioned this pull request Feb 28, 2024

[Regression] Yi 200K models won't load in latest release #29252

Closed

4 tasks

ArthurZucker pushed a commit that referenced this pull request Feb 28, 2024

Use torch.bool instead of torch.int64 for non-persistant causal m…

4f4dfe5

…ask buffer (#29241) use torch.bool instead of torch.int64

ArthurZucker pushed a commit that referenced this pull request Mar 1, 2024

Use torch.bool instead of torch.int64 for non-persistant causal m…

6c45f0f

…ask buffer (#29241) use torch.bool instead of torch.int64

fxmarty mentioned this pull request Mar 18, 2024

[Core generation] Adds support for static KV cache #27931

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `torch.bool` instead of `torch.int64` for non-persistant causal mask buffer#29241

Use `torch.bool` instead of `torch.int64` for non-persistant causal mask buffer#29241
fxmarty merged 1 commit intohuggingface:mainfrom
fxmarty:use-bool-causal-mask

fxmarty commented Feb 23, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Feb 23, 2024

Uh oh!

xenova commented Feb 23, 2024

Uh oh!

amyeroberts left a comment

Uh oh!

fxmarty commented Feb 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fxmarty commented Feb 23, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Feb 23, 2024

Uh oh!

xenova commented Feb 23, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

fxmarty commented Feb 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants