-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Residual Attention: A Simple but Effective Method for Multi-Label Recognition
According to the paper, the base_logit (denoted as g in the paper) should be computing the global feature vector by averaging the features over all spatial locations. This is stated in the equation:
Here, xk represents the feature at location k, and we sum over all locations (49 in this case) and then take the average. This operation is class-agnostic, meaning it’s not specific to any class and is the same for all classes. The global feature vector g represents the overall content of the image, irrespective of specific classes. It serves as a baseline representation of the image content.
In my Implementation i have this:
def forward(self, x):
B, _, H, W = x.size() # batch size, _, height, width
# Compute class-specific attention scores
logits = self.classifier(x) # size: (B, C, H, W)
logits = logits.view(B, self.C, -1) # size: (B, C, H*W)
# Compute class-specific feature vectors
x_flatten = x.view(B, self.d, -1) # size: (B, d, H*W)
# Compute global feature vector
g = torch.mean(x_flatten, dim=2) # size: (B, d) I am computing base_logit (or g) as per the paper’s method. The original implementation seems to be computing something different for base_logit, which doesn’t align with the paper’s description. It’s computing the average class-specific score for each class across all spatial locations, which is not what g represents according to the paper.
This is the original implementation:
def forward(self, x):
# x (B d H W)
# normalize classifier
# score (B C HxW)
score = self.head(x) / torch.norm(self.head.weight, dim=1, keepdim=True).transpose(0,1)
score = score.flatten(2)
base_logit = torch.mean(score, dim=2) # size: (B, C) Is there a reason for this?