The Little Engineer That Could: ARM SIMD

My Pyramid Building Simulator is coming along nicely. Check out the game play video I made for it.

So my Core i5-4570 and nVidia GTX 750Ti run this simulation at an easy 100fps at 1920x1200 pixels. It always leaves me wondering, could it possibly ever be done on mobile, 64 bit iOS or 64 bit Android? If it's possible, it will require some aggressive optimization, as the current code is already AVX SIMD.

But it's an itch I have to scratch: can I do the same on 64 bit ARM NEON? So let's dive into that world: I've never done assembly, intrinsics or SIMD on ARM before, so it's all new to me. I've found a developer that went an interesting route: translate x86 SSE2 to ARM Neon using a translation layer. But I think it pays more to apply ARM Neon intrinsics directly.

So what have I been able to find out so far?

This ARM intrinsic reference is a great resource.
An old blog post at hilbert-space.de warns against intrinsics being much slower than hand written assembly. I believe this is currently no longer true, as compilers have matured, and this was mainly an issue with older gcc compilers.
ARM does have the notion of 16 bit floats, but unfortunately, it seems to be a storage format only, and not suitable for calculations. This is a pity, as it seems to rule out 8-way floating point SIMD on ARM. I may be mistaken, but it looks like you can't do better than 4-way floating point SIMD on ARM, which is a far cry from the x86 world where 8xSIMD (AVX/AVX2) and 16xSIMD (AVX-512) is possible.
NEON Intrinsics look like vXXXq_FMT where v signifies the vector nature, q means 128 bits, and FMT specifies integer/float and width. So for instance: vmulq_f32() that multiplies 128 bit vectors containing 32 bit floats, so this would be 4xSIMD.
For conditional moving of values (which in x86 parlance is vblendps or fsel in PowerPC speak) you would use Bitwise Select, vbslq, in the ARM Neon world. In NEON, this intrinsic is actually much more natural than the x86 counterpart, as it has a more logical operand ordering. It follows the same ordering as the ?: operator in C: vbslq( condition, iftrue, iffalse ).
Writing NEON intrinsics is actually quite enjoyable compared to the x86 world, where MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX-512 transitions left the set of intrinsics quite convoluted. The naming scheme in ARM NEON is much cleaner, and requires less use of references as it contains little surprises in naming. The comparison intrinsics are also easier, as there is no third operand as is the case in x86. The type of condition is specified in the name of the intrinsic instead.
Ugh, NEON doesn't have square root, or reciprocals. Only an estimate for reciprocal, and an estimate for reciprocal of square root. It looks like pmeerw has a solution to this.

I will update this posting as I learn more about ARM SIMD.

The Little Engineer That Could

Wednesday, July 29, 2015

ARM SIMD

No comments:

Post a Comment