This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Description
There is no efficient way to represent loading of narrow-type vector with extension to wide-type vector, e.g. Load 4 uint16_t values and extend to 4 x uint32_t vector. To simulate such operation with the current API, we'd need to load values as a 64-bit scalar (potentially spilling to two registers on 32-bit architectures), transfer to SIMD register (expensive!), and then use shuffles to get it into proper places. With the native SIMD ISA, it can be implemented more efficiently:
PMOVZXWD xmm, [mem] on x86 with SSE4.1
MOVQ xmm, [mem] + PXOR xmm0, xmm0 + PUNPCKLWD xmm, xmm0 on SSE2
VLD1.16 {dX}, [rAddr] + VMOVL.U16 qX, dX on ARMv7+NEON
LD1 {Vx.4H}, xAddr + UXTL Vx.4S, Vx.4H on ARM64