Hi, thank you for your great work on TimeMarker — it’s a very inspiring paper!
I noticed that the paper adopts temporal separator tokens to distinguish between frames. I’m curious:
Have you experimented with directly adding learnable temporal embeddings to each frame instead?
If yes, how does it compare in terms of performance and temporal coherence?
I’m wondering whether direct temporal embeddings might offer advantages, or whether the separator token provides better flexibility or generalization.
Looking forward to your insights!
Hi, thank you for your great work on TimeMarker — it’s a very inspiring paper!
I noticed that the paper adopts temporal separator tokens to distinguish between frames. I’m curious:
Have you experimented with directly adding learnable temporal embeddings to each frame instead?
If yes, how does it compare in terms of performance and temporal coherence?
I’m wondering whether direct temporal embeddings might offer advantages, or whether the separator token provides better flexibility or generalization.
Looking forward to your insights!