Thursday, July 30, 2015

Solving quadratic equations on ARM NEON.

So two years ago, I've been coding a function that solves quadratic equations, 8 at a time using AVX. Lately, I have been looking into ARM NEON to see what performance can be had on mobile devices. This is what I came up with. The square root implementation is available pmeerw.

/*
 * solve aX^2 + bX + c = 0
 * solves 4 instances at the same time, using NEON SIMD without any branching to avoid stalls.
 * returns two solutions per equation in root0 and root1.
 * returns FLT_UNDF if there is no solution due to discriminant being negative.
 * For sqrtv() see: https://pmeerw.net/blog/programming/neon1.html
 * I've put the reciprocal of (2*a) in the argument list as this one is fixed in my particular problem.
 */

inline void evaluate_quadratic4
(
        float32x4_t a,
        float32x4_t twoa_recip,
        float32x4_t b,
        float32x4_t c,
        float32x4_t* __restrict root0,
        float32x4_t* __restrict root1
)
{
        const float32x4_t four4 = vdupq_n_f32( 4.0f );
        const float32x4_t undf4 = vdupq_n_f32( FLT_UNDF );
        const float32x4_t minb = vnegq_f32( b );                        // -b
        const float32x4_t bb = vmulq_f32( b, b );                       // b*b
        const float32x4_t foura = vmulq_f32( four4, a );                // 4*a
        const float32x4_t fourac = vmulq_f32( foura, c );               // 4*a*c
        const float32x4_t det = vsubq_f32( bb, fourac );                // b*b - 4*a*c
        // We want only positive roots!
        const uint32x4_t  dvalid = vcleq_f32( fourac, bb );
        const float32x4_t sr = sqrtv( det );                            // approximation of sqrt( b*b - 4*a*c )
        float32x4_t r0 = vaddq_f32( minb, sr );                         // -b + sqrt( b*b - 4*a*c )
        float32x4_t r1 = vsubq_f32( minb, sr );                         // -b - sqrt( b*b - 4*a*c )
        r0 = vmulq_f32( r0, twoa_recip );                               // ( -b + sqrt( b*b - 4*a*c ) ) / (2*a)
        r1 = vmulq_f32( r1, twoa_recip );                               // ( -b - sqrt( b*b - 4*a*c ) ) / (2*a)
        // Filter out negative roots.
        *root0 = vbslq_f32( dvalid, r0, undf4 );
        *root1 = vbslq_f32( dvalid, r1, undf4 );
}

I benched this code on Android Galaxy Note4, against a scalar version. The speed-up I measured was 3.7X which I think is pretty good.

So, yeah, writing intrinsics is totally justified. Doubly so, because I tried to have the compiler auto vectorize, but whatever flag I tried, results hardly differed: -fno-vectorize, -ftree-vectorize and -fslp-vectorize-aggressive all showed the same performance.

No comments:

Post a Comment