[Fix] Deepseek V3 expert bias routing#41647
Conversation
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks for catching! Can confirm, we previously gathered on the scores with the index.
ArthurZucker
left a comment
There was a problem hiding this comment.
can you just run make fix-copies that will fix dependant models
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: deepseek_v3, glm4_moe, glm4v_moe |
|
Thanks @ArthurZucker, I just pushed the |
|
They are the ones that inherit from deepseek! |
* [Fix] Deepseek V3 expert bias routing * [Fix] fix-copies * [Fix] Run make style
* [Fix] Deepseek V3 expert bias routing * [Fix] fix-copies * [Fix] Run make style
What does this PR do?
By chance we noticed that #40132 seems to have introduced a bug in the Deepseek V3 routing implementation: The Deepseek-V3 technical report explicitly states
This was the case in transformers until #40132 which changed the routing code such that the gating values are now derived from the tensor with the added bias term. I wrote a quick fix for the Deepseek-V3 model in this PR, not sure if other models are also affected. Can you please have a look @ArthurZucker and confirm that this is indeed a bug?
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.