TensorPermutation This is an implementation of tensor permutation in CUDA. It features coalescing and bank-conflict free memory access.