Skip to content

BackPACK with simple attention and additional layers #326

@nhianK

Description

@nhianK

I want to use backpack for computing per-sample gradients and was trying to understand the challenges of using a custom model that uses pytorch nn layers. For example, something like this architecture: https://github.com/codingchild2424/MonaCoBERT/blob/master/src/models/monacobert.py

Some of the basic layers used for computing attention:
self.query = nn.Linear(hidden_size, self.all_head_size, bias=False) # 512 -> 256
self.key = nn.Linear(hidden_size, self.all_head_size, bias=False) # 512 -> 256
self.value = nn.Linear(hidden_size, self.all_head_size, bias=False)

The model also has a trainable nn.Parameter:
self.gammas = nn.Parameter(torch.zeros(self.num_attention_heads, 1, 1))
And some convolutional layers.

What could be some of the challenges I might face while using a model like that and potential solutions to them? Is LayerNorm supported yet?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions