BackPACK with simple attention and additional layers

I want to use backpack for computing per-sample gradients and was trying to understand the challenges of using a custom model that uses pytorch nn layers. For example, something like this architecture: https://github.com/codingchild2424/MonaCoBERT/blob/master/src/models/monacobert.py

Some of the basic layers used for computing attention:
        self.query = nn.Linear(hidden_size, self.all_head_size, bias=False) # 512 -> 256
        self.key = nn.Linear(hidden_size, self.all_head_size, bias=False) # 512 -> 256
        self.value = nn.Linear(hidden_size, self.all_head_size, bias=False)

The model also has a trainable nn.Parameter:
      self.gammas = nn.Parameter(torch.zeros(self.num_attention_heads, 1, 1))
And some convolutional layers. 


What could be some of the challenges I might face while using a model like that and potential solutions to them? Is LayerNorm supported yet? 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BackPACK with simple attention and additional layers #326

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

BackPACK with simple attention and additional layers #326

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions