-
Notifications
You must be signed in to change notification settings - Fork 18
Description
It seems that CLIP-SF and BLIP-SF have not been trained on w1, w2, w3, and w4.
In the code for encoding in UniIR/src/models/uniir_clip/clip_scorefusion/clip_sf.py, it is as follows:
def encode_text(self, text_tensor):
return self.clip_model.encode_text(text_tensor)
def encode_image(self, image_tensor):
return self.clip_model.encode_image(image_tensor)
def fuse_embeddings(self, img_emb, txt_emb):
fused_emb = img_emb + txt_emb
return fused_emb
def encode_multimodal_input(self, txt_tensor, img_tensor, txt_mask, img_mask):
"""
:param txt_tensor:
:param img_tensor:
:param txt_mask: expected shape: [batch_size, 1]
:param img_mask: expected shape: [batch_size, 1]
:return:
"""
txt_emb = self.encode_text(txt_tensor) * txt_mask.unsqueeze(-1)
img_emb = self.encode_image(img_tensor) * img_mask.unsqueeze(-1)
return self.fuse_embeddings(txt_emb, img_emb) # shape: [batch_size,