-
Notifications
You must be signed in to change notification settings - Fork 283
disable rope scaling for training, add yarn during export #917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: haoguo/eagle-export
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -175,6 +175,15 @@ def _get_config_from_draft_or_base(key: str, model: nn.Module): | |
| if self.hf_quant_config is not None: | ||
| template_config["quantization_config"] = self.hf_quant_config | ||
|
|
||
| # For long context quality, we disable rope scaling for training | ||
| # and set yarn during export for inference. | ||
| template_config["rope_scaling"] = { | ||
| "rope_type": "yarn", | ||
| "rope_theta": 10000, | ||
| "factor": 32.0, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure these are the best choices for rope theta and factor. I think these might depend on how long max_position_embeddings actually is. Some testing may be required. Gpt Oss uses rope theta 150k, for example. This may be some tradeoff between short-context and long-context accuracy
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. theta=10k is the default from HF: ref Actually my guess is that it should match the theta used in training. factor should be a tradeoff I think. |
||
| "original_max_position_embeddings": model.eagle_train_length, | ||
| } | ||
|
|
||
| return template_config | ||
|
|
||
| def export_quant_config(self): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,14 +19,7 @@ | |
| "hidden_act": "silu", | ||
| "torch_dtype": "bfloat16", | ||
| "position_embedding_type": "rope", | ||
| "rope_scaling": { | ||
| "factor": 8.0, | ||
| "low_freq_factor": 1.0, | ||
| "high_freq_factor": 4.0, | ||
| "original_max_position_embeddings": 8192, | ||
| "rope_type": "llama3", | ||
| }, | ||
| "rope_theta": 500000.0, | ||
| "rope_scaling": {"rope_type": "default", "rope_theta": 10000}, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you can go further and actually just set rope scaling to null. Not sure if there's a difference in HF
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Setting it to None triggers an error. We are using Llama definition from transformers 5.0 and it requires a rope type. "rope_type":"default" here will use the traditional rope without scaling.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a transforemr verfsion difference. In transforemr 4.x it's ok to leave it None. We are using transformer 5 here. |
||
| "num_hidden_layers": 1, | ||
| "intermediate_size": 14336, | ||
| "num_attention_heads": 32, | ||
|
|
@@ -83,15 +76,9 @@ | |
| "qk_rope_head_dim": 64, | ||
| "rms_norm_eps": 0.00001, | ||
| "rope_scaling": { | ||
| "beta_fast": 1.0, | ||
| "beta_slow": 1.0, | ||
| "factor": 64.0, | ||
| "mscale": 1.0, | ||
| "mscale_all_dim": 1.0, | ||
| "original_max_position_embeddings": 4096, | ||
| "type": "yarn", | ||
| "rope_type": "default", | ||
| "rope_theta": 10000, | ||
| }, | ||
| "rope_theta": 50000.0, | ||
| "routed_scaling_factor": 2.827, | ||
| "scoring_func": "sigmoid", | ||
| "seq_aux": True, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure rope theta goes on the main config and not the rope scaling, and should be set the same for training/inference. Where did this template come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a version difference. In transformers 4.x it stays outside, while in transformers 5 it has to be in "rope_scaling" field.
Ref:
transforemr 4.55: https://github.com/huggingface/transformers/blob/v4.55-release/src/transformers/modeling_rope_utils.py#L110
5.x:
https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_rope_utils.py#L634
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L111