models : move the token embedding norms to the first layer#20943
models : move the token embedding norms to the first layer#20943
Conversation
JohannesGaessler
left a comment
There was a problem hiding this comment.
I think it's fine to move the embedding norm to the first layer and even preferable as it will probably be marginally faster. However, I am concerned that this is indicative of a deeper issue since I don't understand why the tensor -> backend assignment would be different in the first place.
Are you referring to the GET_ROWS issue that I mentioned in #20503 (comment)? I am still not sure, but it seems like we are taking different paths in the logic for assigning the tensors depending if we read them from a file or not. Investigating. |
|
No, I mean that if we save and then load the model again we would expect the tensor assignments to remain the same (since my understanding is that this is changing somehow?). To me it would suggest that |
|
What I am seeing is that the model saving and then loading it works exactly like a normal model of the given architecture. The tensors are being assigned in the same way - according to the The difference comes from the in-memory model - in this case, all tensors are always put on the device. It's as if the |
…20943) * models : move the token embedding norms to the first layer * cont : fix LLM_TENSOR_CONV1D + fix il indexing
Overview
We were keeping the token embedding norms on the input layer buffers. This results in the operations being performed on the CPU:
With this change, we assign the norms to the first layer buffer, so this way they can be offloaded to the GPU when possible:
Additional information
This is causing some of the roundtrip discrepancies found in #20503 (comment)
Requirements