Add transformer-based multi-modality pipelines

**Is your feature request related to a problem? Please describe.**
Create needed infrastructures to add transformer-based multi-modality(e.g. vision and language) pipeline.

**Describe the solution you'd like**
The pipeline is targeted for classification applications and supports loading pre-trained weights (e.g. BERT).