I want to use backpack for computing per-sample gradients and was trying to understand the challenges of using a custom model that uses pytorch nn layers. For example, something like this architecture: https://github.com/codingchild2424/MonaCoBERT/blob/master/src/models/monacobert.py
Some of the basic layers used for computing attention:
self.query = nn.Linear(hidden_size, self.all_head_size, bias=False) # 512 -> 256
self.key = nn.Linear(hidden_size, self.all_head_size, bias=False) # 512 -> 256
self.value = nn.Linear(hidden_size, self.all_head_size, bias=False)
The model also has a trainable nn.Parameter:
self.gammas = nn.Parameter(torch.zeros(self.num_attention_heads, 1, 1))
And some convolutional layers.
What could be some of the challenges I might face while using a model like that and potential solutions to them? Is LayerNorm supported yet?
I want to use backpack for computing per-sample gradients and was trying to understand the challenges of using a custom model that uses pytorch nn layers. For example, something like this architecture: https://github.com/codingchild2424/MonaCoBERT/blob/master/src/models/monacobert.py
Some of the basic layers used for computing attention:
self.query = nn.Linear(hidden_size, self.all_head_size, bias=False) # 512 -> 256
self.key = nn.Linear(hidden_size, self.all_head_size, bias=False) # 512 -> 256
self.value = nn.Linear(hidden_size, self.all_head_size, bias=False)
The model also has a trainable nn.Parameter:
self.gammas = nn.Parameter(torch.zeros(self.num_attention_heads, 1, 1))
And some convolutional layers.
What could be some of the challenges I might face while using a model like that and potential solutions to them? Is LayerNorm supported yet?