Some questions about the paper.

您好，作者大大：
前几天看完您的论文后，我感觉写的非常清晰，因为我的领域之前是NLP的，所以中间阅读我有一些问题，如果您有时间可以为我解答一下么，非常感谢：
1、关于论文中Scale Path 和 Strip Path的QKV分别代表的不一样？例如Scale Path的视觉特征Vi是Q，而 Strip Path的文本特征L是Q，以及为什么Strip Path用了两次QKV，而且两次QKV代表的模态也不一样？因为我理解的模态交互不是一般使用图像（noise）作为Q、text作为K和V，不过也有反过来的，这点我也不太清楚。
2、论文3.3节中In our ClipSAM framework, SAM is configured to produce three masks with varying confidence scores for each box。这里面的masks为什么是3个呢，我理解的是通过训练UMCI完毕得到的图像进行二值化处理，然后选择完point+box后，这里面的mask指的是值为1的部分，也就是选择了前三个置信度最高的区域么？ 这点没太看懂。
3、论文里的Scale Path选择两个kernal一般是为了保持和Strip Path一样是么？ 再次感谢作者。
![image](https://github.com/Lszcoding/ClipSAM/assets/72113285/c00ea04e-31b3-4445-9d81-1927bfc39741)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the paper. #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Some questions about the paper. #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions