So two years ago, I've been coding a function that solves quadratic equations, 8 at a time using AVX. Lately, I have been looking into ARM NEON to see what performance can be had on mobile devices. This is what I came up with. The square root implementation is available pmeerw.
/* * solve aX^2 + bX + c = 0 * solves 4 instances at the same time, using NEON SIMD without any branching to avoid stalls. * returns two solutions per equation in root0 and root1. * returns FLT_UNDF if there is no solution due to discriminant being negative. * For sqrtv() see: https://pmeerw.net/blog/programming/neon1.html * I've put the reciprocal of (2*a) in the argument list as this one is fixed in my particular problem. */ inline void evaluate_quadratic4 ( float32x4_t a, float32x4_t twoa_recip, float32x4_t b, float32x4_t c, float32x4_t* __restrict root0, float32x4_t* __restrict root1 ) { const float32x4_t four4 = vdupq_n_f32( 4.0f ); const float32x4_t undf4 = vdupq_n_f32( FLT_UNDF ); const float32x4_t minb = vnegq_f32( b ); // -b const float32x4_t bb = vmulq_f32( b, b ); // b*b const float32x4_t foura = vmulq_f32( four4, a ); // 4*a const float32x4_t fourac = vmulq_f32( foura, c ); // 4*a*c const float32x4_t det = vsubq_f32( bb, fourac ); // b*b - 4*a*c // We want only positive roots! const uint32x4_t dvalid = vcleq_f32( fourac, bb ); const float32x4_t sr = sqrtv( det ); // approximation of sqrt( b*b - 4*a*c ) float32x4_t r0 = vaddq_f32( minb, sr ); // -b + sqrt( b*b - 4*a*c ) float32x4_t r1 = vsubq_f32( minb, sr ); // -b - sqrt( b*b - 4*a*c ) r0 = vmulq_f32( r0, twoa_recip ); // ( -b + sqrt( b*b - 4*a*c ) ) / (2*a) r1 = vmulq_f32( r1, twoa_recip ); // ( -b - sqrt( b*b - 4*a*c ) ) / (2*a) // Filter out negative roots. *root0 = vbslq_f32( dvalid, r0, undf4 ); *root1 = vbslq_f32( dvalid, r1, undf4 ); }
I benched this code on Android Galaxy Note4, against a scalar version. The speed-up I measured was 3.7X which I think is pretty good.
So, yeah, writing intrinsics is totally justified. Doubly so, because I tried to have the compiler auto vectorize, but whatever flag I tried, results hardly differed: -fno-vectorize, -ftree-vectorize and -fslp-vectorize-aggressive all showed the same performance.