add open-llama model with ckpt#22795
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
cc @ArthurZucker and @younesbelkada |
|
Please help me review this pull request. @ArthurZucker @younesbelkada |
|
Hey! Thanks will review now |
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks for working on this! Seems like the model is overlall very similar, so missing bunch of copied form here and there. Most importantly I dont think we need a new tokenizer, it's still llama tokenizer.
There was a problem hiding this comment.
Not convinced that you need a new configuration file either. Args can be added kind of the fly and not be in the default llama config WDYT?
There was a problem hiding this comment.
I'm concerned that using the default LlamaConfig directly may result in missing parameters and cause errors.
| """ | ||
|
|
||
|
|
||
| @add_start_docstrings( |
There was a problem hiding this comment.
missing copied from statements
There was a problem hiding this comment.
sorry, i did't quite understand how to add the copied from statements for this class, there are slight differences here.
There was a problem hiding this comment.
Ok you can keep it as is!
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM, waiting for @sgugger's review
There was a problem hiding this comment.
Same comment here, is this not the same as in the llama folder?
There was a problem hiding this comment.
Thank you for the reminder. This file is identical to the one in Llama, and since I trained directly with Transformers, there is no need for any conversion. I will delete it.
| """ | ||
|
|
||
|
|
||
| @add_start_docstrings( |
There was a problem hiding this comment.
Ok you can keep it as is!
| loss = None | ||
| if labels is not None: | ||
| # Shift so that tokens < n predict n | ||
| shift_logits = logits[..., :-1, :].contiguous() | ||
| shift_labels = labels[..., 1:].contiguous() | ||
| # Flatten the tokens | ||
| loss_fct = CrossEntropyLoss() | ||
| shift_logits = shift_logits.view(-1, self.config.vocab_size) | ||
| shift_labels = shift_labels.view(-1) | ||
| # Enable model parallelism | ||
| shift_labels = shift_labels.to(shift_logits.device) | ||
| loss = loss_fct(shift_logits, shift_labels) |
There was a problem hiding this comment.
| loss = None | |
| if labels is not None: | |
| # Shift so that tokens < n predict n | |
| shift_logits = logits[..., :-1, :].contiguous() | |
| shift_labels = labels[..., 1:].contiguous() | |
| # Flatten the tokens | |
| loss_fct = CrossEntropyLoss() | |
| shift_logits = shift_logits.view(-1, self.config.vocab_size) | |
| shift_labels = shift_labels.view(-1) | |
| # Enable model parallelism | |
| shift_labels = shift_labels.to(shift_logits.device) | |
| loss = loss_fct(shift_logits, shift_labels) | |
| lm_loss = None | |
| if labels is not None: | |
| # we are doing next-token prediction; shift prediction scores and input ids by one | |
| shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous() | |
| labels = labels[:, 1:].contiguous() | |
| loss_fct = CrossEntropyLoss() | |
| lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1)) |
We usually just use this, but I am guessing the point of the PR is fast / model paralellism so ignore my comment if this doesn't work (we leave parallelism to accelerate)
| The model is mainly based on LLaMA with some modifications, incorporating memory-efficient attention from Xformers, stable embedding from Bloom, and shared input-output embedding from PLAM. | ||
| And the model is pre-trained on both Chinese and English, which gives it better performance on Chinese language tasks. |
There was a problem hiding this comment.
If you have them, would be cool to add the performance gains here!
There was a problem hiding this comment.
This is a great suggestion, but currently I have not conducted a complete ablation experiment. I plan to gradually add it to the documentation after conducting the experiment.
sgugger
left a comment
There was a problem hiding this comment.
Very clean, thanks a lot for adding this! I have just a comment on the config and default checkpoint.
| warnings.warn( | ||
| "Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers\npip install xformers." | ||
| ) |
There was a problem hiding this comment.
Should use our logger here with logger.warn (so move this after the logger is defined below).
| "VisionEncoderDecoderConfig", | ||
| "VisionTextDualEncoderConfig", | ||
| "LlamaConfig", | ||
| "OpenLlamaConfig", |
There was a problem hiding this comment.
Should be removed as there is a checkpoint for OpenLlama.
| r""" | ||
| This is the configuration class to store the configuration of a [`OpenLlamaModel`]. It is used to instantiate an | ||
| Open-Llama model according to the specified arguments, defining the model architecture. Instantiating a | ||
| configuration with the defaults will yield a similar configuration to that of the Open-Llama-7B. |
There was a problem hiding this comment.
Put the full checkpoint name here and link to the Hub. Example we have for GPT-2:
a similar configuration to that of the [gpt2](https://huggingface.co/gpt2) architecture.
It wasn't there for Llama since there is no official checkpoint on the Hub.
There was a problem hiding this comment.
Thank you for the review. The three issues mentioned have been fixed.
|
Thanks a lot for your contribution! |
|
@s-JoL I noticed that the links pertaining to Open-LLaMA are currently leading to 404 errors. Could you please provide some information on what might have happened? |
|
@s-JoL Hi, I can't find a Open-LLaMA checkpoint and I noticed you delete your original repo. What happend? How Can I have a try of Open-LLaMA? |
* update Open-Llama model * update * update format * update doc * update * update stable embedding test * update test case * update format * update readme * fix typo * update name * remove tokenizer and update format * remove convert_open_llama_weights_to_hf * update warning and doc_string --------- Co-authored-by: songliang.bayesian <songliang.bayesian@bytedance.com>
|
@heya5 Possibly due to some controversies surrounding this project, the original author has closed the original project. |
* update Open-Llama model * update * update format * update doc * update * update stable embedding test * update test case * update format * update readme * fix typo * update name * remove tokenizer and update format * remove convert_open_llama_weights_to_hf * update warning and doc_string --------- Co-authored-by: songliang.bayesian <songliang.bayesian@bytedance.com>
This reverts commit c2c99dc.

This PR adds a new model called Open-Llama, which is based on Llama's implementation in Transformers.
In Open-Llama, emory-efficient attention has been added, resulting in a 30% improvement in training efficiency. Additionally, hidden dropout and attention dropout have been added for better generalization during training.
We have also added two optional features: stable embedding from Bloom and shared input-output vectors from PALM, which have been tested and found to improve training stability and performance.
The following code snippet shows the implementation of memory-efficient attention:
At the same time, for maximum compatibility, we have made xformers an optional dependency so that the original implementation can still be used for training and inference if it is not installed.
We implemented pre-training of the Llama model based on transformers + accelerate, incorporating the modifications described above.
Open-Llama
The pre-trained model has already been open-sourced on s-JoL/Open-Llama-V1.
ref: #22386
cc: @sgugger