# Complex arithmetics on a CPU

Complex numbers have a wide range of applications including the previously discussed Fourier transform. This article doesn't aim to be an introduction for complex numbers, but rather discuss the various representations of them. I'm not going deep into the cubic equatations and the problems with the formulas discovered for it, but a quick rundown does not hurt:

- First, the need arised to express the square root of negative numbers, which were previously thought to be impossible. These were
**imaginary numbers**and squaring one yielded a negative real number back. Also it is worth mentioning that an imaginary number corresponded to each negative real number and vice versa. - We wanted to keep these new numbers as compatible with our old systems as we could, so we could make some rearrangments:
- Finally we could combine real and imaginary numbers to get
**complex numbers**:

Now instead of keeping track of infinitely many imaginary numbers, we could reduce all of them to one, the simplest. Thus we defined j (or i as you prefer):

Just to be clear, reals are a subset of complex numbers, so 1, 2 and 3 will be complex as well.

## Basic operations

Complex numbers are defined by their real and imaginary parts. We can also represent them in a coordinate system, where the x axis represents the real part and y is imaginary.

```
struct Complex {
float real;
float imag;
};
Complex c = {3, 4};
```

**Addition** between two complex numbers is simple, just add their components independently, like you would do with vectors.

```
Complex c_add(Complex c1, Complex c2) {
return { c1.real + c2.real, c1.imag + c2.imag };
}
```

**Substraction** is pretty much the same:

```
Complex c_sub(Complex c1, Complex c2) {
return { c1.real - c2.real, c1.imag - c2.imag };
}
```

Things get sligtly complicated with **multiplication**. The trick here is that we have to multiply two imaginary numbers, and they yield a real one but with a negative sign. This is also much harder to interpret geometrically, so I didn't include that representation *yet*.

```
Complex c_mul(Complex c1, Complex c2) {
return { c1.real*c2.real - c1.imag*c2.imag,
c1.real*c2.imag + c2.real*c1.imag };
}
```

Before we continue, take a look at how many real operations did these functions cost. For addition and substraction we needed two additions and substractions respectively and for multiplication we had to do four real multiplications, one addition and one substraction.

We can do better in terms of multiplications, if we use **Gauss' complex multiplication algorithm**. With the introduction of three intermediate variables, we could reduce the number of multiplications needed to three:

```
Complex c_mul_gauss(Complex c1, Complex c2) {
float k1 = c2.real * (c1.real + c1.imag);
float k2 = c1.real * (c2.imag - c2.real);
float k3 = c1.imag * (c2.real + c2.imag);
return { k1 - k3, k1 + k2 };
}
```

There are a few different ways to express this due to the symmetry but we can't go below three multiplications. Also we had to do now three additions and two substractions, a total of eight operations instead of six. Was this tradeoff really worth it?

It depends on your hardware, and you should take a look at the timings each operation takes (example). But as a general thumb of rule, we can make the following observations: addition and substraction cost the same, multiplication is slower and division is even slower.

+ - | |

* | |

/ |

Also for now I'll refer substraction as addition to make things easier.

Finally we need to do some work to get the **division** done:

After expanding the fraction (with the complex conjugate), the imaginary terms from the denominator cancel themselves.

Rearranging the terms:

And in code form:

```
Complex c_div(Complex c1, Complex c2) {
float dsqr = c2.real * c2.real + c2.imag * c2.imag;
return { (c1.real*c2.real + c1.imag*c2.imag) / dsqr,
(c1.imag*c2.real - c1.real*c2.imag) / dsqr };
}
```

This method uses three additions, six multiplications, and two divisions. Not the fastest method, can we make some accelerations here as well?

As a side note, two of the six multiplications are squarings, which can be a little bit faster than regular multiplications.

Only a bit, with complex division, there is some sort of complex multiplication occuring, and we can bring down its four multiplications to three:

```
Complex c_div_fast(Complex c1, Complex c2) {
float k1 = c1.real * (c2.real + c2.imag);
float k2 = c2.real * (c1.real + c1.imag);
float k3 = c2.imag * (c1.real - c1.imag);
float dsqr = c2.real * c2.real + c2.imag * c2.imag;
return { (k1 - k3) / dsqr,
(k2 - k1) / dsqr };
}
```

This method does 6 additions, 5 multiplications and 2 divisions.

## Polar form

We can also rewrite our complex numbers into polar coordiantes. Instead of representing them with real and imaginary values, we can define a length and angle for each complex number:

We can present this new form with sines and cosines:

Where *r* and *θ* can be obtained from (using atan2 here):

And we can use Euler's formula to simplify our notation and this allows for the use of nice exponentional identities.

You are probably already familiar with this. The emphasis in this article is on how can this form used to accelerate our computations. The good news is that complex multiplication becomes really simple, it only takes one real multiplication and one addition:

Also the division simplifies as well to one real division and one substraction:

Cool, now what about addition? The bad news is that you can't really do that withoud converting back to rectangular form. And this conversion really takes its toll: **two squarings**, a **square root**, an **arctangent** one way, **sines** and **cosines** with **multiplications** backwards. These operations are usually much slower than the regular arithmetic ones, so in practice where you have to do additions as well, this form is rarely used.

If you happen to have a use case where lots of repeated complex multiplications take place without any additions meanwhile, this might be useful, but otherwise you are better off with the regular form.

## SIMD

There are other things to consider when evaluating performance, and one of them is parallelization. For addition and substraction it is easy to see, that the real and imaginary parts are independent from each other, and so they can be calculated concurrently. For multiplication, this is not so trivial. The naive version with four multiplications has a concurrency diagram with depth two:

Whereas the similar diagram for Gauss' method is one step deeper:

In conclusion: if we have four multiplier units available, we can do multiplication in only two steps. We can use the **single instruction, multiple data** (SIMD) extensions in x86 to accomplish something like that. The following code loads the four floats into a single 128 bit register specifically designed for vector operations (4 times sizeof(float), which is 32 equals to 128), then breaks them into two regiters. Both contains the original complex numbers, but in a redundant manner, the variable names *aabb* and *cdcd* indicate how they are stored. Of course this shuffling is redundant, if you can load the values into the registers this way in the first place.

The key method is the multiplication, for which SSE provides a single instruction that does elementwise multiplication. The remaining addition and substraction is done by hand, which is a pretty ugly way, but SIMD operations were not designed with horizontal (inter-register) operations in mind.

```
#include <x86intrin.h>
int main() {
float a = 1, b = 2, c = 3, d = 4;
// Loading values into the 128 bit register
// __mm_set_ps accepts the parameters in reverse order
__m128 data = _mm_set_ps(d, c, b, a);
// See https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_shuffle_ps
__m128 aabb = _mm_shuffle_ps(data, data, 0x50);
__m128 cdcd = _mm_shuffle_ps(data, data, 0xEE);
// Multiplication is done as a single operation
__m128 mulres = _mm_mul_ps(aabb, cdcd);
float result[4];
// Retrieving the results
_mm_storeu_ps(result, mulres);
float re = result[0] - result[3];
float im = result[1] + result[2];
}
```

We can eliminate these manual operations, if we use the ADDSUBPS instruction, which computes addition and substraction in an alternating manner alongside the vector. The fully utilize the 128 bit registers, we can process two complex multiplications instead of just one. This requires an additional SIMD multiplication, but hey, this tradeoff seems nice:

```
#include <x86intrin.h>
int main() {
float a1 = 1, b1 = 2, c1 = 3, d1 = 4;
float a2 = 5, b2 = 6, c2 = 7, d2 = 8;
__m128 abab = _mm_set_ps(b2, a2, b1, a1);
__m128 cdcd = _mm_set_ps(d2, c2, d1, c1);
// Selecting and duplicating c's
__m128 cccc = _mm_moveldup_ps(cdcd);
// Same for d's
__m128 dddd = _mm_movehdup_ps(cdcd);
cccc = _mm_mul_ps(abab, cccc);
__m128 baba = _mm_shuffle_ps(abab, abab, 0xB1);
dddd = _mm_mul_ps(baba, dddd);
// add and sub in an alternating way
__m128 res = _mm_addsub_ps(cccc, dddd);
}
```

And finally we can simply use the raw capabilities of parallelization and use the registers, as they were regular floats. Process 4 complex multiplications at the same time the naive way:

```
#include <x86intrin.h>
int main() {
__m128 a = _mm_set_ps(1, 2, 3, 4);
__m128 b = _mm_set_ps(5, 6, 7, 8);
__m128 c = _mm_set_ps(9, 10, 11, 12);
__m128 d = _mm_set_ps(12, 14, 15, 16);
__m128 m1 = _mm_mul_ps(a, c);
__m128 m2 = _mm_mul_ps(b, d);
__m128 m3 = _mm_mul_ps(a, d);
__m128 m4 = _mm_mul_ps(b, c);
__m128 res_real = _mm_sub_ps(m1, m2);
__m128 res_imag = _mm_add_ps(m3, m4);
}
```

I then benchmarked them on my Intel Core i5-4590S for 100 million iterations, and the results were somewhat interesting:

Method | 100M iterations | 400M numbers |
---|---|---|

Manual add | 1.642s | 6.502s |

Two at the same time | 1.912s | 3.845s |

Four at the same time | 2.377s | 2.377s |

From the first two columns, the first method seemed the fastest. BUT if we factor in that the last method did process four times as much numbers as the first one, it paints a much different story. Seems like no clever tricking can accelerate the process this way. I also tried Gauss' method and the FMA instruction set for even less instructions, but they yielded the same results as the naive method.

## Endnote

This introduction is far from complete, there are a lot of topics not covered here, complex numeral systems for example. You can also play around with these methods, and who knows, maybe you can find a more optimized complex multiplication or division method.

## Sources

- Aleksandr Cariow, An algorithm for dividing two complex numbers (arXiv)
- Intel® 64 and IA-32 Architectures Optimization Reference Manual