Add `ToDevice` transform to execute transform logic on GPU

**Is your feature request related to a problem? Please describe.**
With `ToDevice` after `ToTensor` or `EnsureType`, we can move the data to GPU and leverage `CacheDataset` to avoid duplicated CPU -> GPU copying in every epoch. And also can support other GPU transforms to accelerate.