## How to Find a Fast Floating-Point atan2 Approximation

#### Context

Over a short period of time, I came across nearly identical approximations of the two parameter arctangent function, atan2, developed by different companies, in different countries, and even in different decades. Fascinated with how the coefficients used in these approximations were derived, I set out to find them.

This atan2 implementation is based around a rational approximation of arctangent on the domain -1 to 1:

$atan(z) \approx \dfrac{z}{1.0 + 0.28z^2}$

But looking at the arctangent, it also resembles a cubic polynomial. So in addition to deriving the rational approximation, I set out to find a polynomial version too. It turns out there is a 3rd order polynomial that is a close match to the rational approximation:

$atan(z) \approx 0.9724x - 0.1919x^3$

The rest of this post will cover how to reproduce the polynomial coefficients, how to implement them in the atan2 context, and wrapping up with some other thoughts.

(Note: I find it odd that this 3rd order polynomial did not come up in my research. I believe this is because the rational approximation was developed for fixed-point math and was adopted for floating-point later, but I did not find evidence of this [3].)

#### Arctangent Approximation

The Remez Method is an iterative, minimax algorithm that for a given function converges to a polynomial (or rational approximation) which minimizes the max error. Conveniently, the Boost libraries includes a robust implementation in the math::tools namespace with the class remez_minimax [4]. The class’s numerator() and denominator() contain the coefficients at the end of the iterative process.

Although I planned to target single precision floats, it was necessary to use double precision to get useful coefficients. Below is the code necessary to initialize the remez_minimax class for atan and run iterations to find the coefficients for the polynomial approximation. At a certain point, max_change stays constant or becomes periodic and the process can stop.

math::tools::remez_minimax<double> approximate_atan(
[](const double& x) { return atan(x); }, // double precision required
3, 0, // 3rd degree polynomial
-1, 1, // -1 to 1 range
false, 0, 0, 64); // other params

do
{
approximate_atan.iterate();
} while (approximate_atan.max_change() > 0.00001);

const math::tools::polynomial<double> coefficients = approximate_atan.numerator();
const double max_error = approximate_atan.max_error();


After running this, the coefficients of {0.0, 0.97239411, 0.0, -0.19194795} were isolated, and the max error was just slightly less than 0.005 radians, which was comparable to rational version.

I confirmed the results twice:

1. On Wolfram Alpha with input: sup(abs(atan(x) – 0.9724*x + 0.1919(x^3)) from -1 to 1
2. Testing every single precision floating point value between -1 and 1 compared against the atan in MSVC.

The other metric I looked at was average error. Although the max error was comparable, the average error was 40% higher. If the max error is acceptable for any value (about 0.28 degrees), then it is likely also acceptable for all values.

#### Implementing atan2

Set with a simple approximation of arctangent over the range [-1,1], the next step is integrating it into an atan2 implementation. To handle values outside the [-1,1] range, the following arctan identity is useful:

$arctan(y/x) = \pi/2 - arctan(x/y), \text{if} \left|y/x\right| > 1.$

To handle the different permutations of signs of y and x, I followed the definition of atan2 from Wikipedia [1]. There are comments to call out each of the 7 cases. The two extra cases handle the the y term being greater than x in which the above identity is needed.

atan

// Polynomial approximating arctangenet on the range -1,1.
// Max error < 0.005 (or 0.29 degrees)
float ApproxAtan(float z)
{
const float n1 = 0.97239411f;
const float n2 = -0.19194795f;

return (n1 + n2 * z * z) * z;
}


atan2

float ApproxAtan2(float y, float x)
{
if (x != 0.0f)
{
if (fabsf(x) > fabsf(y))
{
const float z = y / x;
if (x > 0.0)
{
// atan2(y,x) = atan(y/x) if x > 0
return ApproxAtan(z);
}
else if (y >= 0.0)
{
// atan2(y,x) = atan(y/x) + PI if x < 0, y >= 0
return ApproxAtan(z) + PI;
}
else
{
// atan2(y,x) = atan(y/x) - PI if x < 0, y < 0
return ApproxAtan(z) - PI;
}
}
else // Use property atan(y/x) = PI/2 - atan(x/y) if |y/x| > 1.
{
const float z = x / y;
if (y > 0.0)
{
// atan2(y,x) = PI/2 - atan(x/y) if |y/x| > 1, y > 0
return -ApproxAtan(z) + PI_2;
}
else
{
// atan2(y,x) = -PI/2 - atan(x/y) if |y/x| > 1, y < 0
return -ApproxAtan(z) - PI_2;
}
}
}
else
{
if (y > 0.0f) // x = 0, y > 0
{
return PI_2;
}
else if (y < 0.0f) // x = 0, y < 0
{
return -PI_2;
}
}

return 0.0f; // x,y = 0. Could return NaN instead.
}


Notice that only finite floating point values are handled. Permutations of +/- infinity and +/- 0 all have defined values under the IEEE standard. I decided to not include these for this this function.

#### Looking for a Bit More Speed

The goal of the approximation is to be faster than the precise implementation and with the math simplified (at least as much as I could show), the remaining optimization targeted the branching and reducing the amount of floating point comparisons.

It is possible to eliminate the three branches that determined the +/-PI or +/-PI/2 offsets to atan with bit manipulation instead. The nice thing here, and it is even suggested by the disassembly, was that some of the logic could be performed on the ALU and FPU in parallel. Overall this made an impact of about 5%, but at the cost of being harder to read.

float ApproxAtan2(float y, float x)
{
const float n1 = 0.97239411f;
const float n2 = -0.19194795f;
float result = 0.0f;

if (x != 0.0f)
{
const union { float flVal; uint32 nVal; } tYSign = { y };
const union { float flVal; uint32 nVal; } tXSign = { x };

if (fabsf(x) >= fabsf(y))
{
union { float flVal; uint32 nVal; } tOffset = { PI };

// Add or subtract PI based on y's sign.
tOffset.nVal |= tYSign.nVal & 0x80000000u;
// No offset if x is positive, so multiply by 0 or based on x's sign.
tOffset.nVal *= tXSign.nVal >> 31;
result = tOffset.flVal;

const float z = y / x;
result += (n1 + n2 * z * z) * z;
}
else // Use atan(y/x) = pi/2 - atan(x/y) if |y/x| > 1.
{
union { float flVal; uint32 nVal; } tOffset = { PI_2 };

// Add or subtract PI/2 based on y's sign.
tOffset.nVal |= tYSign.nVal & 0x80000000u;
result = tOffset.flVal;

const float z = x / y;
result -= (n1 + n2 * z * z) * z;
}
}
else if (y > 0.0f)
{
result = PI_2;
}
else if (y < 0.0f)
{
result = -PI_2;
}

return result;
}


Example disassembly of ALU and FPU operations being interleaved:

...
tOffset.nVal |= tYSign.nVal & 0x80000000u;
and         ecx, 80000000h
or ecx, dword ptr[tOffset]
result = tOffset.flVal;

float z = y / x;
divss       xmm4, xmm5
tOffset.nVal *= tXSign.nVal >> 31;
shr         eax, 1Fh
imul        ecx, eax
result += (n1 + n2 * z * z) * z;
...


#### SIMD Optimization

For work, I re-wrote ApproxAtan2 using SSE2 intrinsics to calculate 4 atan2 results simultaneously. Taking out any overhead of packing the 4 sets of x and y values, it ran about 40% faster than 4 separate calls to ApproxAtan2 run in serial. The function ended up being 38 intrinsic instructions which intuitively feels a bit more than necessary, but I limited my time to look at it. I might go back and try writing it again with the goals of being: more readable, use less instructions, and allow myself to use newer instruction sets.

#### Results

The tests I ran were a number of different inputs representing some different contexts with about 1 billion pairs of values. The values below are in seconds and have had the baseline, or time to generate inputs, removed. (I only ran one comparable test of the SIMD version vs the polynomial approximation.)

The tests represents different inputs of what I expected in typical geometric applications. I also included a contrived test to demonstrate how the DIVSS operation on i7 can have dramatically different performance based on the input. The polynomial approximation with reduced branching seems to be 25% to 50% faster.

Test std::atan2 rational approx polynomial approx polynomial SIMD
1 100.3 24.1 18.4 46.0 (11.5)
2 70.1 14.4 7.4 na
3 70.5 14.3 8.1 na
4 8.3 51.2 50.8 na
1. 1 billion values between -1 and 1 generated from sinf and cosf
2. 1 billion values between -1 and 1 roughly evenly spaced
3. 1 billion values between -1e6 and 1e6
4. How to get poor performance or small number divided by large number: 1 billion values where y = 0.5 and x = 1e20

#### Limitations of Remez Minimax

The Boost class for the Remez minimax algorithm is quite nice in that you can test a number of polynomial and rational combinations to find the best set of coefficients for the required performance. I have tried this for arccos and tanh as well. However the algorithm only minimizes the error for polynomial and rational approximations when there are other types of function approximations. Modern FPU architectures can perform sqrt comparably fast to division for example. Recursive functions or continued fractions are also often decent approximations. In the case of arccos there are long known approximations using both sqrt and recursion which are superior to polynomial and rational functions found using the Remez method. There are also cases where the Remez algorithm can get stuck or become unstable, although there are work arounds.