#1579 notes some unfinished business:
The Simd and m128i etc. type generation should be equivalent, but they're not in terms of code; the Simd impls currently use fill to avoid more unsafe code here.
Notice from the above that u32x4, u16x8 and u8x16 are the same size as u128 and m128i but cost about twice as much to generate here. This indicates the fill code may be sub-optimal.
Additionally, the m128i impl performed even worse when transmuting a u128 value (~4.3ns or +%130) which, as far as I can tell, is purely because the u128 value is returned via rax, rdx while the __m128i value is returned via rdx, r10 (with rax equal to the struct address). I don't understand this.
Optimizing Fill for such cases may not be possible without specialization, and even then it's unclear if we'd want to due to the implied value-breaking changes.
Optimizing SIMD impls would require either specialization or replacing the generic Simd<$ty, LANES> impls with a (large) number of specific impls.