Add NucleusX Model by syncdoth · Pull Request #27259 · huggingface/transformers

syncdoth · 2023-11-03T06:17:27Z

What does this PR do?

This PR adds a new model named NucleusX. This model is contributed by Sehyun Choi and NucleusAI. The model is based on the Retentive Network architecture, and the code is largely adapted from this repo, which again borrows core implementations from torchscale. We are planning to release our paper and weights soon.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

We kindly request the review of this new model from @ArthurZucker and @younesbelkada!

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

syncdoth · 2023-11-03T16:57:08Z

This weight is not released yet; we are planning to release the weight at this link soon! We have included this link to pass the configuration testing requiring a link to the checkpoint.

syncdoth · 2023-11-04T16:19:30Z

The current test failure at tests_pr_documentation_tests is due to the incorrect repo_id and links, namely NucleusAI/NucleusX-7B and https://huggingface.co/NucleusAI/NucleusX-7B used in examples and model docs. These checkpoints are not released yet; we plan to release them soon.

syncdoth · 2023-11-04T16:21:31Z

cc: @sippycoder and also @LysandreJik!

ArthurZucker · 2023-11-07T16:24:47Z

Hey! Thanks for opening the PR, I'll let @Rocketknight1 do a first review as he is more familiar with this kind of models!

Rocketknight1 · 2023-11-08T17:23:04Z

Hi all! RetNets seem like a really interesting architecture, so I'm quite excited to take a look - I'll try to review this in the next day or two.

Rocketknight1

Hi all, I just looked through this! Overall, the core modelling code looks very solid, and I couldn't find much to complain about. We normally encourage the use of # Copied from in these PRs, but given that RetNets differ significantly from Transformers, most functions here will be unique to NucleusX.

I also think the test coverage is good, in particular the tests confirming that outputs are equivalent in parallel/recurrent/chunkwise mode.

Before we can merge, though, we need checkpoints to be uploaded. Also, we ideally need an integration test. The purpose of these tests is to ensure that model output for a specific checkpoint remains numerically constant, which is very important to ensure that future updates don't create silent errors. Here is an example of an integration test that you can copy for NucleusX.
If your checkpoints are too large for our CI, we can make a tiny-random-nucleusx model to use for the integration test. An integration test confirming generation output remains constant when do_sample=False would also be helpful!

Overall though, this looks like a really solid PR, and I suspect we shouldn't have much trouble including this in transformers. Thank you for your contribution!

HuggingFaceDocBuilderDev · 2023-11-09T15:35:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Rocketknight1 · 2023-11-09T16:08:42Z

Suggested change

>>> configuration = NucleusXConfig()

>>> configuration = NucleusXConfig(decoder_layers=2)

Also, one more comment! The doctest runner is crashing on this file, and I suspect the reason is that it's running out of memory because you're initializing a 7B model in float32 and so using 28GB of memory, which is a lot for the doctest runner! Maybe change this line to initialize a much smaller model?

I was trying to find out why the doctest is failing for a long time, and this makes total sense!

syncdoth · 2023-11-09T16:56:21Z

@Rocketknight1 Thanks for reviewing this PR! I have gone through the comments and resolved them. There are also some other updates:

dtype handling for tensors created in NucleusXRelPos to have the same dtype as the model weights
rename some *layer_norm modules to *rms_norm for conformity.
removed subln option (Sub-LayerNorm), which is not applicable to our choice of FFN (SwiGLU).

There are other minor changes, which can be found in the commit logs.

As per weight release, we are working hard to make that happen :) We'll ping here when the weights are ready for public release.

Thanks again!

syncdoth · 2023-11-09T17:04:06Z

When use_cache=True, NucleusXMultiScaleRetention.parallel_forward will be less efficient. This is because use_cache=True makes the NucleusXMultiScaleRetention.parallel_forward to compute past_key_values, which incurs another O(T^2) computations.

use_cache=True should be set only when we want to do recurrent forward following the parallel forward (e.g. during generation, we compute the prompt in parallel, but generate with recurrent mode).

cc @gante to this bit - is our generation code ready to handle networks with multiple forward modes?

Rocketknight1 · 2023-11-10T14:05:07Z

Also @syncdoth, while we're waiting for the checkpoints, there are some tests failing in this PR that are unrelated. If you pull the latest version from main and rebase your PR, that should fix them.

syncdoth · 2023-11-10T14:37:35Z

@Rocketknight1 @gante For your question about "generate code handling networks with multiple forward mode" in another comment, this is my take on that: when we call prepare_inputs_for_generation, if we detect past_key_values, it means that the prompt to the generation has been computed, and using recurrent forward is better. Hence the forward_mode = "recurrent" line below.

Note that forward_mode is just a string (used like an enum) that the model takes as forward input to select the forward mode at each forward step!

@Rocketknight1 @syncdoth We can support multiple generation modes, but the implementation for it depends on a few factors! The audio models are the best examples. For instance:

Whisper wraps generate and accepts additional flags. These flags trigger additional arguments for generate (e.g. a custom logits processor to generate timestamps) or postprocessing

Bark contains 3 internal models, and wraps generate to call generate on them in a sequence

Additionally, if model.forward accepts multiple modes, you can also prepare the flags in model. prepare_inputs_for_generation, as written above :)

syncdoth · 2023-11-14T16:26:36Z

Also @syncdoth, while we're waiting for the checkpoints, there are some tests failing in this PR that are unrelated. If you pull the latest version from main and rebase your PR, that should fix them.

This may be a beginner question, but should I rebase main and (force?) push or merge main and push?

Rocketknight1 · 2023-11-14T17:15:31Z

Probably the easiest way to do it is to pull the latest version of main, then rebase your branch onto main, and then force push.

fakerybakery · 2023-11-19T01:04:16Z

Hi @syncdoth, do you know what happened to Nucleus AI? The website is now down

syncdoth · 2023-11-20T01:38:50Z

Hi @syncdoth, do you know what happened to Nucleus AI? The website is now down

This is unrelated to this PR but there's some maintenance going on with the website. Hang tight :)

Rocketknight1 · 2023-11-20T13:10:41Z

btw @syncdoth if you're still getting test failures, try 'sync upstream' on the main branch of your forked repo, then on your development machine, pull the latest main branch, change to the add_nucleus_x branch, rebase and finally force push. Should resolve everything!

github-actions · 2023-12-18T08:04:49Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Rocketknight1 · 2023-12-18T13:41:19Z

Don't stale, please! This looks quite close to being ready! (cc @syncdoth - let me know if you need any help with the last bit!)

syncdoth · 2023-12-19T14:36:59Z

We are on the verge of releasing the weight! There's been a bit of delay in the schedule 🥲

The last bit should be updating the weight links in the docs and writing the integration tests; We are working on it hard!

…tivation_fn=swish

Removes the accidentally added comma in tests/generation/test_utils.py

When loading the model weights in different dtype other than fp32, `.float` statements may cause troubles. This commit handles the tensor creations and float casting to be aware of the `dtype` of the weights.

Reason: since we are using GLU, sub-layernorm is not well-defined.

This follows the example of other models, such as LongT5, idefics, llama, etc.

@Rocketknight1

This resolves the comments by @Rocketknight1

syncdoth · 2023-12-24T01:31:42Z

Hi @Rocketknight1, I’m seeing test failure related to document building, and testing the run of NucleusXForCausalLM.forward example. It seems that it might be due to .from_pretrained from a 7B checkpoint killing the worker, like the previous example in the configuration. Do you think I should change the example to a smaller one?

Maykeye · 2023-12-25T09:38:02Z

Does it require some tinkering to use generate in not parallel mode? (I don't have RAM for processing 16KB prompt in parallel)

I dumped source to model folder, edited config to treat it as trusted_remoted_code=True thingy, parallel works fine, as in test:

In [7]: print(tokenizer.decode(model.generate(**tokenizer("Hello my name is", return_tensors="pt").to("cuda"), max_new_tokens=20, do_sample=False, forward_mod
   ...: e="parallel").ravel()))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/home/fella/src/llama/text-generation-webui/models/NucleusAI_Nucleus-X/modeling_nucleus_x.py:370: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  cache = (current_kv, scale, torch.tensor(prev_seqlen + 1, dtype=torch.long))
<s> Hello my name is Tina and I am a 25 year old female. I am a very outgoing person

but recurrent no

In [8]: print(tokenizer.decode(model.generate(**tokenizer("Hello my name is", return_tensors="pt").to("cuda"), max_new_tokens=20, do_sample=False, forward_mod
   ...: e="recurrent").ravel()))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/home/fella/src/llama/text-generation-webui/models/NucleusAI_Nucleus-X/modeling_nucleus_x.py:370: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  cache = (current_kv, scale, torch.tensor(prev_seqlen + 1, dtype=torch.long))
<s> Hello my name is the most of the world.
The first thing I noticed was the size of the room. It

(Even if I say in config.json to use recurrent forward mode, 16KB prompt fails to pass through model.generate unless I use forward_mode='recurrent')

Rocketknight1 · 2024-01-08T13:11:51Z

Hi @syncdoth, sorry for the Christmas delay! You're correct, though - the issue is almost certainly caused by the docstring trying to load a model too big for the test runner. Is there any smaller checkpoint we can use? You could also try torch_dtype=torch.bfloat16.

syncdoth · 2024-02-02T08:07:43Z

Haha plz don't stale this! We are still working hard to put out the model. We are working on a small model to pass the PR requirement, but it has been a lower priority unfortunately :( will finish to finish this within mid Feb!

ArthurZucker · 2024-02-02T08:21:16Z

No worries 🤗

syncdoth commented Nov 3, 2023

View reviewed changes

syncdoth marked this pull request as ready for review November 4, 2023 16:19

syncdoth changed the title ~~[WIP] Add NucleusX Model~~ Add NucleusX Model Nov 4, 2023

Rocketknight1 reviewed Nov 9, 2023

View reviewed changes

Comment thread src/transformers/models/nucleus_x/modeling_nucleus_x.py Outdated

Comment thread src/transformers/models/nucleus_x/modeling_nucleus_x.py Outdated

Rocketknight1 reviewed Nov 9, 2023

View reviewed changes

syncdoth commented Nov 9, 2023

View reviewed changes

syncdoth commented Nov 10, 2023

View reviewed changes

syncdoth force-pushed the add_nucleus_x branch from c5f48f5 to 244e64c Compare November 14, 2023 18:02

syncdoth force-pushed the add_nucleus_x branch 2 times, most recently from 8663ad0 to 0d9ee02 Compare November 23, 2023 13:46

syncdoth mentioned this pull request Dec 7, 2023

Initialize word embedding layer syncdoth/RetNet#31

Closed

Sehyun Choi added 4 commits December 24, 2023 02:37

[WIP] Add NucleusX Model

3893a8d

ignore attention checking for nucleusx

9b5c55e

update license

e401cd0

solve chunkwise equivalence and add test

8854aee

Sehyun Choi and others added 21 commits December 24, 2023 02:37

refactor relpos, check use_cache with gradient_checkpoint, default ac…

28edfba

…tivation_fn=swish

necessary changes for code styling

849b669

fix nucleusai url

810c2b1

remove nucleus_x._set_gradient_checkpointing

bc004aa

remove timm dependency and copy drop_path

9d8cf38

resolve unused config, init parameter of NucleusXModels

ce1dc6a

add reset_parameters to NucleusXRMSNorm

4c62997

refactor past_key_value to tuple, update docs and tests accordingly

7bf0332

refactor argument docs, adds to _toctree

36b4f8c

add empty returns

c0d5944

put nucleus_x under text model in _toctree

60b688e

update documents

595cc96

revert added comma

730fcd2

Removes the accidentally added comma in tests/generation/test_utils.py

better handling of dtype for half precision weights

943f04e

When loading the model weights in different dtype other than fp32, `.float` statements may cause troubles. This commit handles the tensor creations and float casting to be aware of the `dtype` of the weights.

rename layernorm -> rms_norm

70a3948

remove subln

20afe9a

Reason: since we are using GLU, sub-layernorm is not well-defined.

test parallel vs recurrent with 10 random examples

79d9934

remove build_rms_norm, redefine NucleusXRMSNorm

f393bc9

This follows the example of other models, such as LongT5, idefics, llama, etc.

fix typo, inline init of ffn, smaller example model size

79514c1

This resolves the comments by @Rocketknight1

update docs and checkpoint ref

c162de3

integration testing

b624d05

syncdoth force-pushed the add_nucleus_x branch from 0d9ee02 to b624d05 Compare December 23, 2023 17:49

huggingface deleted a comment from github-actions Bot Feb 2, 2024

LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Feb 14, 2024

kiucho mentioned this pull request Mar 12, 2024

Integration with transformers library syncdoth/RetNet#37

Open

	>>> configuration = NucleusXConfig()
	>>> configuration = NucleusXConfig(decoder_layers=2)

Conversation

syncdoth commented Nov 3, 2023

What does this PR do?

Before submitting

Who can review?

Uh oh!

syncdoth Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

syncdoth commented Nov 4, 2023

Uh oh!

syncdoth commented Nov 4, 2023

Uh oh!

ArthurZucker commented Nov 7, 2023

Uh oh!

Rocketknight1 commented Nov 8, 2023

Uh oh!

Rocketknight1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 9, 2023

Uh oh!

Rocketknight1 Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

syncdoth Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

syncdoth commented Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

syncdoth Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Nov 10, 2023

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Nov 10, 2023

Uh oh!

syncdoth Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

syncdoth commented Nov 14, 2023

Uh oh!

Rocketknight1 commented Nov 14, 2023

Uh oh!

fakerybakery commented Nov 19, 2023

Uh oh!

syncdoth commented Nov 20, 2023

Uh oh!

Rocketknight1 commented Nov 20, 2023

Uh oh!

github-actions Bot commented Dec 18, 2023

Uh oh!

Rocketknight1 commented Dec 18, 2023

Uh oh!

syncdoth commented Dec 19, 2023

Uh oh!

syncdoth commented Dec 24, 2023

Uh oh!

Maykeye commented Dec 25, 2023

Uh oh!

Rocketknight1 commented Jan 8, 2024

Uh oh!

syncdoth commented Feb 2, 2024

Uh oh!

ArthurZucker commented Feb 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Rocketknight1 left a comment •

edited

Loading

syncdoth commented Nov 9, 2023 •

edited

Loading

syncdoth Nov 9, 2023 •

edited

Loading

syncdoth Nov 10, 2023 •

edited

Loading

gante Nov 14, 2023 •

edited

Loading