Is your feature request related to a problem? Please describe.
Create needed infrastructures to add transformer-based multi-modality(e.g. vision and language) pipeline.
Describe the solution you'd like
The pipeline is targeted for classification applications and supports loading pre-trained weights (e.g. BERT).