Add NucleusX Model#27259
Conversation
There was a problem hiding this comment.
This weight is not released yet; we are planning to release the weight at this link soon! We have included this link to pass the configuration testing requiring a link to the checkpoint.
|
The current test failure at |
|
cc: @sippycoder and also @LysandreJik! |
|
Hey! Thanks for opening the PR, I'll let @Rocketknight1 do a first review as he is more familiar with this kind of models! |
|
Hi all! RetNets seem like a really interesting architecture, so I'm quite excited to take a look - I'll try to review this in the next day or two. |
There was a problem hiding this comment.
Hi all, I just looked through this! Overall, the core modelling code looks very solid, and I couldn't find much to complain about. We normally encourage the use of # Copied from in these PRs, but given that RetNets differ significantly from Transformers, most functions here will be unique to NucleusX.
I also think the test coverage is good, in particular the tests confirming that outputs are equivalent in parallel/recurrent/chunkwise mode.
Before we can merge, though, we need checkpoints to be uploaded. Also, we ideally need an integration test. The purpose of these tests is to ensure that model output for a specific checkpoint remains numerically constant, which is very important to ensure that future updates don't create silent errors. Here is an example of an integration test that you can copy for NucleusX.
If your checkpoints are too large for our CI, we can make a tiny-random-nucleusx model to use for the integration test. An integration test confirming generation output remains constant when do_sample=False would also be helpful!
Overall though, this looks like a really solid PR, and I suspect we shouldn't have much trouble including this in transformers. Thank you for your contribution!
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
| >>> configuration = NucleusXConfig() | |
| >>> configuration = NucleusXConfig(decoder_layers=2) |
Also, one more comment! The doctest runner is crashing on this file, and I suspect the reason is that it's running out of memory because you're initializing a 7B model in float32 and so using 28GB of memory, which is a lot for the doctest runner! Maybe change this line to initialize a much smaller model?
There was a problem hiding this comment.
I was trying to find out why the doctest is failing for a long time, and this makes total sense!
|
@Rocketknight1 Thanks for reviewing this PR! I have gone through the comments and resolved them. There are also some other updates:
There are other minor changes, which can be found in the commit logs. As per weight release, we are working hard to make that happen :) We'll ping here when the weights are ready for public release. Thanks again! |
There was a problem hiding this comment.
When use_cache=True, NucleusXMultiScaleRetention.parallel_forward will be less efficient. This is because use_cache=True makes the NucleusXMultiScaleRetention.parallel_forward to compute past_key_values, which incurs another O(T^2) computations.
use_cache=True should be set only when we want to do recurrent forward following the parallel forward (e.g. during generation, we compute the prompt in parallel, but generate with recurrent mode).
There was a problem hiding this comment.
cc @gante to this bit - is our generation code ready to handle networks with multiple forward modes?
|
Also @syncdoth, while we're waiting for the checkpoints, there are some tests failing in this PR that are unrelated. If you pull the latest version from main and rebase your PR, that should fix them. |
There was a problem hiding this comment.
@Rocketknight1 @gante For your question about "generate code handling networks with multiple forward mode" in another comment, this is my take on that: when we call prepare_inputs_for_generation, if we detect past_key_values, it means that the prompt to the generation has been computed, and using recurrent forward is better. Hence the forward_mode = "recurrent" line below.
Note that forward_mode is just a string (used like an enum) that the model takes as forward input to select the forward mode at each forward step!
There was a problem hiding this comment.
@Rocketknight1 @syncdoth We can support multiple generation modes, but the implementation for it depends on a few factors! The audio models are the best examples. For instance:
- Whisper wraps
generateand accepts additional flags. These flags trigger additional arguments forgenerate(e.g. a custom logits processor to generate timestamps) or postprocessing - Bark contains 3 internal models, and wraps
generateto callgenerateon them in a sequence
Additionally, if model.forward accepts multiple modes, you can also prepare the flags in model. prepare_inputs_for_generation, as written above :)
This may be a beginner question, but should I rebase main and (force?) push or merge main and push? |
|
Probably the easiest way to do it is to pull the latest version of main, then rebase your branch onto main, and then force push. |
c5f48f5 to
244e64c
Compare
|
Hi @syncdoth, do you know what happened to Nucleus AI? The website is now down |
This is unrelated to this PR but there's some maintenance going on with the website. Hang tight :) |
|
btw @syncdoth if you're still getting test failures, try 'sync upstream' on the |
8663ad0 to
0d9ee02
Compare
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Don't stale, please! This looks quite close to being ready! (cc @syncdoth - let me know if you need any help with the last bit!) |
|
We are on the verge of releasing the weight! There's been a bit of delay in the schedule 🥲 The last bit should be updating the weight links in the docs and writing the integration tests; We are working on it hard! |
…tivation_fn=swish
Removes the accidentally added comma in tests/generation/test_utils.py
When loading the model weights in different dtype other than fp32, `.float` statements may cause troubles. This commit handles the tensor creations and float casting to be aware of the `dtype` of the weights.
Reason: since we are using GLU, sub-layernorm is not well-defined.
This follows the example of other models, such as LongT5, idefics, llama, etc.
This resolves the comments by @Rocketknight1
0d9ee02 to
b624d05
Compare
|
Hi @Rocketknight1, I’m seeing test failure related to document building, and testing the run of NucleusXForCausalLM.forward example. It seems that it might be due to |
|
Does it require some tinkering to use I dumped source to model folder, edited config to treat it as In [7]: print(tokenizer.decode(model.generate(**tokenizer("Hello my name is", return_tensors="pt").to("cuda"), max_new_tokens=20, do_sample=False, forward_mod
...: e="parallel").ravel()))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/home/fella/src/llama/text-generation-webui/models/NucleusAI_Nucleus-X/modeling_nucleus_x.py:370: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
cache = (current_kv, scale, torch.tensor(prev_seqlen + 1, dtype=torch.long))
<s> Hello my name is Tina and I am a 25 year old female. I am a very outgoing personbut recurrent no In [8]: print(tokenizer.decode(model.generate(**tokenizer("Hello my name is", return_tensors="pt").to("cuda"), max_new_tokens=20, do_sample=False, forward_mod
...: e="recurrent").ravel()))
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/home/fella/src/llama/text-generation-webui/models/NucleusAI_Nucleus-X/modeling_nucleus_x.py:370: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
cache = (current_kv, scale, torch.tensor(prev_seqlen + 1, dtype=torch.long))
<s> Hello my name is the most of the world.
The first thing I noticed was the size of the room. It(Even if I say in config.json to use recurrent forward mode, 16KB prompt fails to pass through model.generate unless I use forward_mode='recurrent') |
|
Hi @syncdoth, sorry for the Christmas delay! You're correct, though - the issue is almost certainly caused by the docstring trying to load a model too big for the test runner. Is there any smaller checkpoint we can use? You could also try |
|
Haha plz don't stale this! We are still working hard to put out the model. We are working on a small model to pass the PR requirement, but it has been a lower priority unfortunately :( will finish to finish this within mid Feb! |
|
No worries 🤗 |
What does this PR do?
This PR adds a new model named NucleusX. This model is contributed by Sehyun Choi and NucleusAI. The model is based on the Retentive Network architecture, and the code is largely adapted from this repo, which again borrows core implementations from torchscale. We are planning to release our paper and weights soon.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
We kindly request the review of this new model from @ArthurZucker and @younesbelkada!
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.