-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Phi-2 #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phi-2 #97
Conversation
|
@jbarrow If you set you permissions, I can push some updates in a commit to your fork. Basically I added the cache and cleaned up the code a bit (used our built-in "new" GELU) I still am not having success getting reasonable outputs from the model even in fp32. Also I wasn't able to reproduce the final layer you are getting in the .txt files. Maybe you could share more details on how you got those... Still seems like there is a bug somewhere.. |
|
I have "Allow edits by maintainers" checked -- is there any other permission I'm missing? To get the outputs, I just ran the print(model(**tokens))As for the "new" GELU, is it the |
That's what I meant. I must have been typing the wrong thing last night 😪 Hey I just pushed the commit, it does switch to |
|
So, I believe I've identified the source of the error, but the correction will require a bit more than I have time for this morning (will get back to it this evening). I believe the Rotary Embedding implementations are different between MLX and the Phi-2 repo. The Phi-2 implementation is here: https://huggingface.co/microsoft/phi-2/blob/main/modeling_phi.py#L171 I was looking at the weights from the attention heads in the first attention layer, and the differences pop up pre-/post-rotary embedding. The outputs below are the first 5 values at each of the 23 token positions for the input prompt. |
awni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good, and works really well for me, merging!!
|
how long did conversion of the weights take? it's going for slowly for me (keeps fluctuating but up to 1hr estimate) |
|
Maybe what's slow is your download time? You might need a faster internet connection :) The conversion itself should be fast once the model is downloaded (<<1 minute). But let me know if you run into trouble there. |
|
That might be the download time? It’s a 5GB download from the huggibgface hub. Once the weights are cached, maybe 20s for conversion for me? |
|
I have phi-2 downloaded from huggingface, then just running my internet is very slow, on Starlink and currently on a google meet call hammering my network 😂 it's getting there, not a big deal |

Currently this is a draft -- the
generatefunction does not work and has dropped all caching. I'll get back to this tomorrow before/after work, but feel free to modify and make any necessary changes!Phi-2 transformer model is really interesting, it required a few modifications:
NewGELUActivationfunctionParallelBlockwhich uses a singleLayerNormand combines residuals, attention outputs, and ff outputs at the endRoPEpositional embeddingsLoading the model requires breaking the
Wqkvmatrices into aWq,Wk, andWvmatrix (though it's possible to reimplement the attention without doing that, I suppose). Loading the model at all requireseinopsinstalled:I've tested that the forward pass of the model lines up with the 🤗 implementation. But for generate, I need to put in the kv-caching (which might mean removing the
TransformerDecoderimplementation altogether, which would mean updating theconvert.pyscript).I think it's close to being there, but will require (a) some care to get generation working right, and (b) some care to get fast inference on a MacBook. But very excited to run a good model locally on a MacBook Air. 😄