I was training RetNet model using your codebase.
But I found there's no initialization of word embedding layers.
So the loss scale was very poor. (7B model's initial loss was 3000+)
I think we need to add word embedding initialization to this codebase.
I was training RetNet model using your codebase.
But I found there's no initialization of word embedding layers.
So the loss scale was very poor. (7B model's initial loss was 3000+)
I think we need to add word embedding initialization to this codebase.