Update: Fast GeLU Approximation#744
Conversation
| - https://arxiv.org/pdf/1606.08415.pdf | ||
| """ | ||
| return x * mx.sigmoid(1.773 * x) | ||
| return x * mx.sigmoid(1.702 * x) |
There was a problem hiding this comment.
IIRC @angeloskath added this, maybe you had another implementation in mind?
|
@nkasmanoff could you check the tests that failed? |
Co-authored-by: Awni Hannun <awni.hannun@gmail.com>
@awni looks like they passed now? I committed your suggestion and when the tests re-ran so I couldn't see what the last failure was. |
|
So I can add some context to this and then we can choose what to do. The Now having said that and given the fact that using Wdyt? |
|
My impression is we want the 1.702 for fast, only to ensure consistency in the MLX adaptations of models made in transformers My only concern is that if we keep 1.772, this causes the vision encoder in LlaVA to have seemingly worse performance when asked about images, as well as fail the tests @mzbac set up with the Transformers implementation of llava. |
|
Yeah I agree, I think we should change it. Just to be clear though one could always just write a simple one line activation function. There is no need to use |
|
I think in it's current form it's a bit of a trap since it sounds like the fast gelu that has become slightly standard. I would probably change and encourage people to use the regular I've been under the impression the sigmoid is the same as the tanh just implemented with a sigmoid, but I don't think I ever verified it. Where did that one come from? |

Proposed changes
Update the fast approximation for GeLU activation.
Checklist
Put an
xin the boxes that apply.pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes