Discrete and Fast Fourier transforms Part 2: FFT

In the previous part, we showed how we could build signals from sinusoids, and how to analyse built signals to get their components back. Inverting the matrix using Gaussian elimination showed to be really slow but we could accelerate that using the easily invertible DFT matrix. Much faster, right? Multiplying an n by n matrix with an n long vector is an O(n^2) operation. With a 1 second long signal and a sample rate of 44100 Hz, that still means 1 944 810 000, almost two billion steps.

Going faster

Take a look at the DFT matrix defined before. Now instead of the powers of W, the angles in radians will be shown for each element.

Complex functions are often plotted with domain coloring but this is not the case now. From zero to π, the whole range of hues are covered with a bright saturation. From π to 2π (with π included but 2π aka 0 is not) the same range is covered buth with a slightly lower saturation. As result, points at the opposite side of the circle will share the same hue and they'll only differ in saturation.

With this way of coloring, you might notice some sort of symmetry in the DFT matrices. For example for even cases they are made up from four similarly colored parts. But the symmetry we are going to exploit is a bit more subtle, so we need to cover some elements first.

If we cover or ignore each even column, the remaining elements will show symmetry vertically, top and bottom is identical which means that the submatrix from the top five rows will be exactly the same as the bottom five rows in this size 10 DFT case.

Alternatively we can cover each odd column as well. In that case the symmetry is a bit weaker, the elements at top and bottom will be shown with the opposite color of the colorwheel, bright colors turn into dull ones and vice versa. You might also say that the difference of their angles is π.

When we are applying the Discrete Fourier Transform to a signal, we multiply it with the DFT matrix. For this purpose we can split this matrix into two smaller ones. The original multiplication with a size six DFT matrix:

The amplitude was marked with C, so we could use A and B after splitting (C is the sum of A and B):

What did we gain from that? Nothing, as matrix multiplication is just regular multiplication and summation, and we could have done this splitting for any matrix we wanted to; the gains come now.

How much speed did we gain from all of that? A regular DFT of size 10 does a matrix vector multiplication, which takes a hundred (complex!) multiplications and 90 additions. This brand new method method does two 5 sized DFTs (25*2 multiplications and 20*2 additions) and to combine the partial results, five additions and five substractions (which is computationally the same as addition). We roughly halved our computations.

Can we go further? Not really. Could we go further? I deliberately choose size 6 and 10 DFTs as our examples, since they are divisible by two but only once. In case we have a DFT with a size divisible by four, nothing prevents us to do additional speedups and so on with 8 or 16. This is called the radix two FFT and as the name suggests this works well with sizes being the powers of 2.

Radix two FFT

So we finally reached the Fast Fourier Transform. On a more complex example with more than one speedup possibility this is how it looks like (FFT 8):

The colors indicate the various DFTs shown earlier. This structure is called the butterfly diagram due to the resemblance of these X-like structures to butterflies. There are two things worth mentioning here.

  1. The colored lines represents vectors, except the leftmost column. If you ever saw a butterfly diagram for an 8 sized FFT, you probably noticed that there were way more crossing lines for that. We can extract our vectors into regular lines, but in my opinion this does not look much cleaner.
  2. The input is scrambled at the beginning. At the first (or last depending on your viewpoint; the one where we combined two 4 FFTs into one 8 sized) operation, we splitted our input data into odd and even parts. And again at which point we had groups with 4k, 4k+2, 4k+1 and 4k+3 numbers. Not a nice order, but what is worse, at the end, we have our input completely scrambled, seemingly without any sort of structure: 0, 4, 2, 6, 1, 5, 3, 7.

This is the butterfly diagram if we expand all vectors. Now we have do decide how to deal with this unnatural order of inputs.

  1. We don't care. If we do an out of place algorithm, which means we don't want our output overwrite the input, then we can just allocate a new buffer for the frequency components and then access our input in whatever manner we want.
  2. But sometimes we are constrained by limited memory, and having an algorithm which can perform in place is always a good thing. We can find some efficient way to scramble our input before applying the FFT itself.
  3. Lastly, we can leave our input unscrambled, but use an in place algorithm, in which case, the output will be scrambled. Maybe it's not a problem for you. This is called decimation in frequency whereas the earlier method was named decimation in time.

The third option is the easiest, you can compare the differences on the butterfly diagrams.

This scrambling is a permutation and we could use some permutation matrix, but this would be a terrible idea. We worked so hard so far to eliminate the use of matrix multiplication, then this close to the end, we introduce a new one... no. Turns out, this permutation is not random, it is actually the bit reversed order of the indices. This might come as a surprise, but take a deeper look and see how each bit corresponds in terms of odds and evens.

Number Bits Rev. bits New value
00000000
10011004
20100102
30111106
41000011
51011015
61100113
71111117

How can we apply this? FFT is a widely used algorithm, and many solutions were developed over time, ranging form naïve bit reversal to bitwise hacks, reading from predefined tables or even memory which can be addressed by bit reversed order adresses as well. In most applications the size of the input and output is a fixed constant, and a lot of these techniques depend on this (for example the single byte bit reverse order only works with FFTs of size 256).

So here is an example code in C to demonstrate all of this in of place. This snippet uses the built in complex numbers of C. Also note how the actual FFT code (in fft_req()) is only like 10-12 lines, and the rest is reordering the input in bit reversed order. You may find more examples on this at rosettacode.org.

void fft_req(double complex array[], int n) {
	if (n == 1) return;
	
	fft_req(array, n/2);
	fft_req(array + n/2, n/2); // n/2 offset
	
	for (int i = 0; i < n/2; i++) {
		double complex c1 = array[i];
		double complex c2 = array[i + n/2] * cexp(-I * PI * 2 * i / n);
		
		array[i] = c1 + c2;
		array[i + n/2] = c1 - c2;
	}
}

void fft(double complex array[], int n) { // n is the length of the array
	// Naive bit reversal permutation
	for(int i = 0; i < n; i++) {
		int rev_i = i; // i, but with its bits reversed
		
		int bits = 0; // log2(n)
		for (int n_ = n; n_ != 1; n_ >>= 1)
			bits++;
		int s = bits - 1;

		for(int v = i>>1; v != 0; v >>= 1) {
			rev_i <<= 1;
			rev_i |= v & 1;
			s--;
		}
		rev_i <<= s;
		rev_i &= n - 1;

		// Only swap when needed
		if(i < rev_i) {
			double complex temp = array[i];
			array[i] = array[rev_i];
			array[rev_i] = temp;
		}
	}
	
	fft_req(array, n);
}

This algorithm uses divide and conquer approach in a recursive manner. If you are familiar with sorting algorithms, this technique is probably familiar for you. Both quicksort and FFT halves the array and traces back the problem to two smaller ones. They share similarity in complexity as well, FFT is an O(nlogn) operation, where log indicates logarithm with base 2. That means our original problem with 44100 samples can be solved in (roughly) 680 396 steps. Well, thats feasible... or would be, only if 44100 happened to be a power of 2, which sadly isn't.

The Cooley-Tukey FFT

The question now is how can we accelerate the Discrete Fourier Transform for sizes that are not powers of two?. The most trivial answer is to pad it: add a point to the left or to the right of a 7 length sequence. As you can see the interpolated function - the one that we obtained from the Fourier coefficients - doesn't really change much. Also note because of the periodictity, we get the same coefficients regardless if we add the padding zero to the beginning or the end.

Of course, that's not a complete and perfect solution. For seven points, it is good enough, bit for nine? Should we use seven padding zero points? Or discard one? If we could split our even data into two parts, why can't we use the same principles and divide the samples by three or five or seven? Well the algorithm named after James Cooley and John Tukey does exactly this. Actually the radix 2 method is just a special case of that, but the most popular; if you look up the Cooley-Tukey algorithm on the internet, you will find dozens of these butterfly diagrams shown earlier.

The product of the two smallest primes, which are not two is 15, so we are going to use this size as our example. As you can see below, the DFT matrix of that size, is getting pretty large - no trivial symmetries in sight but the coloring resembles some kind of structure. To construct an efficient algorithm, we are going to do the same four steps as with the radix 2 case.

One thing remains now from understanding the whole Cooley-Tukey algorithm. It is only a matter of notation, no speedups can be gained from this, but we can apply the twiddle factors in a more sofisticated manner: in the form of 5 by 5 DFTs. After doing all the DFT3s in the previous image but before applying the twiddle factors, our columns look like this:

We can make the following observation: If we take each third line from the beginning (in natural order, the first, fourth, seventh, tenth and thirteenth) and look at their corresponding twiddle factors, they form a perfect 5 by 5 DFT matrix. We can generalize this further for rows 3K+1 and 3K+2 with further twiddle factors multiplied elementwise to the resulting length 5 vector. The twiddle factors of the twiddle factors...

So we did DFT3s 5 times and DFT5s 3 times. Even if we used a different splitting at the beginning (3 instead of 5) we would have got the same results. In the general case this algorithm requires the prime factorization of the number, for example for a 60 sized sample, we would do splits of 2, 2, 3 and 5 (in any order).

Summary

We covered discrete and fast Fourier transforms and the most basic algorithms, but there is still room to improve. Depending on your use case, you can take a look into how these algorithms work on special sizes, how they affect and use cache optimally without misses, how to reduce the number of required multiplactions further, how to implement in this in a hardware with high efficiency, how to run them in a highly parallelised environment and so on. For now, only a few not strictly related miscellaneous topics are remaining.

Miscellaneous topics

ifft

With regular DFTs applying the inverse transformation was as easy as the forward: a matrix multiplication. With these accelerations, can we do inverse FFT in O(nlogn) time? The answer is simple, yes, only the twiddle factors need to be modified. In the radix 2 case this is as simple as changing this line from this:

double complex c2 = array[i + n/2] * cexp(-I * PI * 2 * i / n);

to this (no minus sign):

double complex c2 = array[i + n/2] * cexp(I * PI * 2 * i / n);

Also since we said in the last part that we haven't done any normalization in the regular transform part, we have to do it now. So divide the result with the length of the samples!

A concrete example

You saw that all the Discrete Fourier Transform is to transform an input array into an output array. What if you have a sound file, and you want to work with its spectrum? If the wav file has 56000 samples and a sample rate of 8000, what frequency will the output array's 18457th element correspond to?

Pretty simple actually. If the sample rate was 1, it would be 18456 / 56000 = 0.330 Hz, also note how I substraced one from 18457, because the indexing starts from zero. If the sample rate is higher, then our measured frequencies will be higher as well, so the final answer is 18456 / 56000 * 8000 = 2637 Hz.

For a python example, let's suppose you have a wav file:

From freesounds.org; made by night_soundscapes; licensed under CC BY-NC 3.0. source

And the full code example below. We can use the fftfreq() method of numpy to generate all frequencies, even the repeated ones; and rfftfreq() to also do this, but this discards the later half. Or sums that to the first part to be more precise.

import numpy as np
from scipy.io import wavfile
import matplotlib.pyplot as plt

sample_rate, data = wavfile.read(
	"321232__night-soundscapes__horror-train-effect.wav")

# Convert it to mono by averaging left and right channels
data = data.sum(axis=1) / 2

spectrum = np.fft.rfft(data)
frequencies = np.fft.rfftfreq(len(data), 1/sample_rate)  # Time between samples

# Plot the spectrum finally
plt.plot(frequencies, np.abs(spectrum))
plt.show()

And on the plotted spectrum you can see the frequencies ranging from 0 to 22100:

Relation to regular Fourier transforms

In the very beginning of the first part I said that this article is an introduction for those who have no knowledge of regular Fourier transforsm, but chances are high that you have already seen one, or at least the definition of it (with maybe a slightly different notation).

Some parts seem familiar, for example the exponential part, while I haven't ever mentioned integrals yet. What happens if we expand our DFT matrix to infinity? Note that this is not a precise operation and the only purpose of this is visual intuition.

As you can see we could increase our original DFT matrix in two dimension: one resulted in the Discrete-time Fourier Transform, while with the other we got the coefficents of the Fourier series, and when we combined them, we finally got the Fourier transform. Intuitively the summations turned into integrals, but there are lots of unexplained remarks here. How do we strech a matrix to infinity? From 0 to ∞? From -∞ to ∞? Between two given numbers but with infinite granuality? Also these concepts have deeper connections between them, like periodicity, which we haven't talked about, so the image above should only serve as a visual intuitive helper.

Sources

  1. Allen B. Downey: Think DSP; Green Tea Press (link)
  2. Ronald N. Bracewell: The Fourier Transform and its Applications third edition
  3. katjaas.nl
  4. ECSE-4530 Lectures by Rich Radke
  5. Gauss and the history of the fast Fourier transform