Is your feature request related to a problem? Please describe.
Add transformers that can be leveraged for processing multi-modal data (i.e. vision and language). The transformer block can be also used for creating cross-attention modules.
Describe the solution you'd like
The architecture can be imported from HuggingFace.