Add multi-modality transformers 

**Is your feature request related to a problem? Please describe.**
Add transformers that can be leveraged for processing multi-modal data (i.e. vision and language). The transformer block can be also used for creating cross-attention modules. 
**Describe the solution you'd like**
The architecture can be imported from HuggingFace.