## Thursday, July 30, 2015

### Solving quadratic equations on ARM NEON.

So two years ago, I've been coding a function that solves quadratic equations, 8 at a time using AVX. Lately, I have been looking into ARM NEON to see what performance can be had on mobile devices. This is what I came up with. The square root implementation is available pmeerw.

```/*
* solve aX^2 + bX + c = 0
* solves 4 instances at the same time, using NEON SIMD without any branching to avoid stalls.
* returns two solutions per equation in root0 and root1.
* returns FLT_UNDF if there is no solution due to discriminant being negative.
* For sqrtv() see: https://pmeerw.net/blog/programming/neon1.html
* I've put the reciprocal of (2*a) in the argument list as this one is fixed in my particular problem.
*/

(
float32x4_t a,
float32x4_t twoa_recip,
float32x4_t b,
float32x4_t c,
float32x4_t* __restrict root0,
float32x4_t* __restrict root1
)
{
const float32x4_t four4 = vdupq_n_f32( 4.0f );
const float32x4_t undf4 = vdupq_n_f32( FLT_UNDF );
const float32x4_t minb = vnegq_f32( b );                        // -b
const float32x4_t bb = vmulq_f32( b, b );                       // b*b
const float32x4_t foura = vmulq_f32( four4, a );                // 4*a
const float32x4_t fourac = vmulq_f32( foura, c );               // 4*a*c
const float32x4_t det = vsubq_f32( bb, fourac );                // b*b - 4*a*c
// We want only positive roots!
const uint32x4_t  dvalid = vcleq_f32( fourac, bb );
const float32x4_t sr = sqrtv( det );                            // approximation of sqrt( b*b - 4*a*c )
float32x4_t r0 = vaddq_f32( minb, sr );                         // -b + sqrt( b*b - 4*a*c )
float32x4_t r1 = vsubq_f32( minb, sr );                         // -b - sqrt( b*b - 4*a*c )
r0 = vmulq_f32( r0, twoa_recip );                               // ( -b + sqrt( b*b - 4*a*c ) ) / (2*a)
r1 = vmulq_f32( r1, twoa_recip );                               // ( -b - sqrt( b*b - 4*a*c ) ) / (2*a)
// Filter out negative roots.
*root0 = vbslq_f32( dvalid, r0, undf4 );
*root1 = vbslq_f32( dvalid, r1, undf4 );
}
```

I benched this code on Android Galaxy Note4, against a scalar version. The speed-up I measured was 3.7X which I think is pretty good.

So, yeah, writing intrinsics is totally justified. Doubly so, because I tried to have the compiler auto vectorize, but whatever flag I tried, results hardly differed: -fno-vectorize, -ftree-vectorize and -fslp-vectorize-aggressive all showed the same performance.

## Wednesday, July 29, 2015

### ARM SIMD

My Pyramid Building Simulator is coming along nicely. Check out the game play video I made for it.

So my Core i5-4570 and nVidia GTX 750Ti run this simulation at an easy 100fps at 1920x1200 pixels. It always leaves me wondering, could it possibly ever be done on mobile, 64 bit iOS or 64 bit Android? If it's possible, it will require some aggressive optimization, as the current code is already AVX SIMD.

But it's an itch I have to scratch: can I do the same on 64 bit ARM NEON? So let's dive into that world: I've never done assembly, intrinsics or SIMD on ARM before, so it's all new to me. I've found a developer that went an interesting route: translate x86 SSE2 to ARM Neon using a translation layer. But I think it pays more to apply ARM Neon intrinsics directly.

So what have I been able to find out so far?

• This ARM intrinsic reference is a great resource.
• An old blog post at hilbert-space.de warns against intrinsics being much slower than hand written assembly. I believe this is currently no longer true, as compilers have matured, and this was mainly an issue with older gcc compilers.
• ARM does have the notion of 16 bit floats, but unfortunately, it seems to be a storage format only, and not suitable for calculations. This is a pity, as it seems to rule out 8-way floating point SIMD on ARM. I may be mistaken, but it looks like you can't do better than 4-way floating point SIMD on ARM, which is a far cry from the x86 world where 8xSIMD (AVX/AVX2) and 16xSIMD (AVX-512) is possible.
• NEON Intrinsics look like vXXXq_FMT where v signifies the vector nature, q means 128 bits, and FMT specifies integer/float and width. So for instance: vmulq_f32() that multiplies 128 bit vectors containing 32 bit floats, so this would be 4xSIMD.
• For conditional moving of values (which in x86 parlance is vblendps or fsel in PowerPC speak) you would use Bitwise Select, vbslq, in the ARM Neon world. In NEON, this intrinsic is actually much more natural than the x86 counterpart, as it has a more logical operand ordering. It follows the same ordering as the ?: operator in C: vbslq( condition, iftrue, iffalse ).
• Writing NEON intrinsics is actually quite enjoyable compared to the x86 world, where MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX-512 transitions left the set of intrinsics quite convoluted. The naming scheme in ARM NEON is much cleaner, and requires less use of references as it contains little surprises in naming. The comparison intrinsics are also easier, as there is no third operand as is the case in x86. The type of condition is specified in the name of the intrinsic instead.
• Ugh, NEON doesn't have square root, or reciprocals. Only an estimate for reciprocal, and an estimate for reciprocal of square root. It looks like pmeerw has a solution to this.