I’m currently studying fast numeric code at university, where I’ve been confronted with the problem of optimizing the WalshHadamard transform using SIMD intrinsics. To be honest, I don’t know a lot about the maths behind it, but apparently it has a lot of applications, specifically in cryptography, signal processing and quantum computing.
Since it was so much fun to apply my fresh knowledge about intrinsics on a realworld problem, I thought it may be worth sharing this experience here on my blog. Be aware though that I’m not an intrinsicsguru, which is why there is certainly more optimized code available.^{1}
If you have never heard of intrinsics, you should definitely read this article that shows how much performance can be gained in various programming languages when the right instructions are used.
Also note that in the following, we will measure performance in flops per cycle. That is, given a program, we evaluate how many floating point operations are needed to run the program, and measure how many cycles the program took to complete. Luckily for me, this whole environment was provided by my university, but in theory it could be done locally, too. You’ll just have to get a way of accurately measure the cycle count, which is a bit more tricky than it sounds.
For the expert readers, the measured program will be compiled using GCC 8.3.1 with the flags O3 fnotreevectorize mavx2 mfma
.
The processor is a Intel Xeon Silver 4210.
Even if Intel® AVX512 instructions are supported on this chip, we’ll be limiting ourselves to Intel® AVX2.
Understanding the problem
Alright, so what is the task? We’re given a vector \(x\) with eight doubles, and want to transform it using the WalshHadamard matrix \(H_8\). \[ H_8 \cdot x = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{bmatrix} \cdot \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \\ x_6 \\ x_7 \end{bmatrix} \]
We can see right away that there sure is some pattern to the signs of the matrix. This is no coincidence. The WalshHadamard transform can be defined as \[ H_m = \begin{bmatrix} H_{m1} & H_{m1} \\ H_{m1} & H_{m1} \end{bmatrix} \] where \(H_0 = 1\) and \(m > 0\). This means the matrix will always be square, and all four quarters are identical, except that the lower right quarter is negated. Of course, we will exploit this property later in our code.
So, let’s try to write a first implementation without any intrinsics at all.
First off, let me exactly define the input A
and the expected output C
.
We want to apply the transform to all column vectors of A
, where A
is stored in a columnmajor order, which means the columns are contiguous in memory.
This means with A[0]
, we can access the first column, which consists of 8 doubles.
C
is a matrix of the same dimension as A
, but all columns are the result of a transformation with a column of A
.
In essence, we are thereby calculating \(C = H_8 \cdot A\).
The following defines help us access the matrix sizes in code.


Without intrinsics, we only need a simple loop that iterates over all columns and applies the transform like so.


For this code, the performance measuring black box outputs 0.6 flops per cycle. What does this mean? Roughly speaking, it means that every cycle on average that many floating point operations complete.
For comparison, the processor on which the measurement was made has a throughput of 2 flops per cycle for the vectorized addition, vaddsd
.
You can check this yourself in the instruction table maintained by Agner Fog, which is an excellent resource for optimizing code at assembly level.
We know the CPU would be able to process two floating point additions per cycle, but our code makes only partially use of that capability.
Now why do I compare the result to a vectorized addition, even though we are passing the fnotreevectorize
flag to GCC?
If you write a simple program that adds two doubles and compile it with the flag, you can see that in assembly the vaddsd
instruction is indeed used.
The suffix sd
here means that it’s a scalar double precision operation, so only one double addition is being performed.
I guess GCC does not count using scalar SIMD instructions as vectorization, which makes sense.
Introducing vector intrinsics
So what can we do to improve? If you look at the loop above, you see that in each row, we want to sum all components of the row. The only difference is that different signs are being used.
But we’ve already heard that there’s a pattern somewhere.
So if you look closely to the quarters on the left of the transform matrix, they appear to be exactly the same.
Comparing it with the loop in wht_composed_novectors()
, we can clearly see that the operands are also the same.
Hence, we could calculate the results of the sums in the upper left quarter, and use these for the calculations in the lower left quarter.
As for the right quarters, things are very similar. The only difference are the signs. So if we calculate the sums in the upper right quarter, we can negate the results and use it for the sums in the lower right quarter.
In total, we can now save calculating the upper four values of c
, and replace it with a single subtraction!
Let’s suppose for a moment that we found a simple way to calculate the lower four values. Then our source code could look like this.


aamm
and amam
are constants that we’ll use later on.
What’s important is the overall concept on the higher level.
First, we load the lower and the higher four doubles of our input columns.
Next, we apply the wht4x4()
function that returns us the addition of all four doubles in both loaded vectors.
Finally, using these sums, we can easily calculate all entries of the output vector using a single addition and a subtraction.
Now, how does wht4x4()
work?
I already knew what I was looking for at this point, but it was quite tricky to choose the right instructions.
When browsing the Internet I found this answer on StackOverflow, which did not calculate the correct results.
However, after some onpaper debugging I fixed the errors and was left with the following piece of code.


Let’s approach this in a more abstract way.
We have a tuple \(a = (a_1, a_0, a_3, a_2)\) as input, and want calculate \[a’ = (a_0 + a_1 + a_2 + a_3, a_0  a_1 + a_2  a_3, a_0 + a_1  a_2  a_3, a_0  a_1  a_2 + a_3).\]
As a reminder, this is what is needed to calculate the upper left quarter in the loop of the simple implementation.
wht4x4()
calculates \(a’\) in four steps.
First, we perform a permutation, where we swap the first half of the vector and the second half individually. This is needed, because we need to somehow “move” an \(a_1\) to the beginning of the vector, and a \(a_0\) to the second entry of the vector. The same holds for \(a_2\) and \(a_3\) respectively.
In the second step, two arithmetic operations are performed in a single FMA instruction. FMA allows for a multiplication followed by an addition in a single instruction. But it only comes at the cost of a multiplication, so it can give really good performance boosts.
What does this FMA do then?
It takes the vector we got from the first step, multiplies it with amam
, and adds it to the input vector.
amam
contains the values \((1, 1, 1, 1)\), so the step results in the vector \[(a_0 + a_1, a_0  a_1, a_2 + a_3, a_2  a_3).\]
With that, we’re practically already halfway done with calculating \(a’\).
In the next step, we take the vector that was just retrieved, and swap the lower half with the upper half, which gives \[(a_2 + a_3, a_2  a_3, a_0 + a_1, a_0  a_1).\]
Finally, we use another FMA to combine the two vectors from the previous steps.
This time we make use of the constant aamm
, which contains the values \((1, 1, 1, 1)\).
And there we have \(a’\).
Note that the constants amam
and aamm
are only defined once in the beginning.
The compiler might optimize this on its own, but you never know.
Further, instead of multiplying the columns a
and c
depending on the iterating variable i
, we explicitly increment them by the column size in each iteration.
This is called strength reduction.
Usually, compilers will do these kinds of optimizations on their own, but in complex cases they might not.
This is why I opted to explicitly tell the compiler to use additions here.
So what does our black box say after all these little optimizations. We’ve reached 4.9 flops per second. That’s a performance speedup of over 8x!
Composed or decomposed?
Is there anything left to optimize that’s “easily” approachable?
So far I’ve only told half the story, because in fact I read the assignment the wrong way. We were supposed to calculate the transform in three steps.
The matrix can actually be decomposed so that \(H_8 = T_3 \cdot T_2 \cdot T_1\). In full verbosity, here are the complete matrices for that.
\[ T_1 = \begin{bmatrix} 1 & & & & 1 & & & \\ & 1 & & & & 1 & & \\ & & 1 & & & & 1 & \\ & & & 1 & & & & 1 \\ 1 & & & & 1 & & & \\ & 1 & & & & 1 & & \\ & & 1 & & & & 1 & \\ & & & 1 & & & & 1 \end{bmatrix} \quad T_2 = \begin{bmatrix} 1 & & 1 & & & & & \\ & 1 & & 1 & & & & \\ 1 & & 1 & & & & & \\ & 1 & & 1 & & & & \\ & & & & 1 & & 1 & \\ & & & & & 1 & & 1 \\ & & & & 1 & & 1 & \\ & & & & & 1 & & 1 \end{bmatrix} \quad T_3 = \begin{bmatrix} 1 & 1 & & & & & & \\ 1 & 1 & & & & & & \\ & & 1 & 1 & & & & \\ & & 1 & 1 & & & & \\ & & & & 1 & 1 & & \\ & & & & 1 & 1 & & \\ & & & & & & 1 & 1 \\ & & & & & & 1 & 1 \end{bmatrix} \]
Note that the spaces are filled with zeros, it’s just a lot more readable this way. If you multiply these matrices by hand as given above, you will get the exact \(H_8\) which was introduced in the beginning.
So let’s try to implement this. This time we have three different transformations. Luckily, we’ll encounter the same patterns as in the first SIMD implementation.


Woah, that looks complicated! Except that it’s not! Let me explain what this does.
Again, first we define two constants as before. Then, the loop will increment in steps of four this time, because we will process four rows at once. The reason for this will become clear in just a moment.
In the loop, we first load the four rows from A
.
On each row, we apply the three different transformations \(T_1\), \(T_2\), and \(T_3\) in that order.
Afterwards, all four vectors are stored into C
.
The pattern is similar to our last implementation, except that this time
 we process more rows in parallel, and
 the transformation is split into three steps.
Note that in the code above, we need to take into account that the amount of rows in our input vector is not divisible by four. I omitted the code for handling this in the snippet, but I’ll reference the full code in the end of the post.
Three missing pieces
Now, the only thing left to understand is how each transformation is computed. Let’s start with the first transformation.
\(T_1\) is relatively simple.
If you’re not that familiar with matrix multiplication, here is what the transformation does for \(x \in \mathbb{R}^8\).
\[
T_1 \cdot x =
\begin{bmatrix}
x_0 + x_4 \\
x_1 + x_5 \\
x_2 + x_6 \\
x_3 + x_7 \\
x_0  x_4 \\
x_1  x_5 \\
x_2  x_6 \\
x_3  x_7
\end{bmatrix}
\]
More informally speaking, the lower half of the output is just the higher four doubles added to the lower ones, and the upper half is the higher four doubles subtracted from the lower ones.
This is exactly what _mm256_add_pd()
and _mm256_sub_pd()
do!
For both functions I introduced a wrapper with an appropriate name.
Since the functions will be inlined, the performance is not affected by this.


Next up is the second transformation. Again, let me show you what we need to calculate. \[ T_2 \cdot x = \begin{bmatrix} x_0 + x_2 \\ x_1 + x_3 \\ x_0  x_2 \\ x_1  x_3 \\ x_4 + x_6 \\ x_5 + x_7 \\ x_4  x_6 \\ x_5  x_7 \end{bmatrix} \]
Do you see how the first half of that vector is arithmetically independent from the second half, and their arithmetic patterns are the same? This means we can write a single function that operates only on one half.
We can use the same permutation that we used to swap the halves of the vector in the first SIMD implementation.
This is then followed by the very same FMA instruction we also used before.
In fact, this is functionally half of the wht4x4()
function we discussed above.


Lastly, with the third transformation it is kind of the same deal.
This is just the other half of the wht4x4()
function.
For completeness, here is the calculation in vector form.
\[
T_3 \cdot x =
\begin{bmatrix}
x_0 + x_1 \\
x_0  x_1 \\
x_2 + x_3 \\
x_2  x_3 \\
x_4 + x_5 \\
x_4  x_5 \\
x_6 + x_7 \\
x_6  x_7
\end{bmatrix}
\]
Again, both halves can be calculated independently.


So did this pay off? Oh yes, it did! With a performance of 5.6 flops per second, we achieved a speedup of over 9x compared to the simple version from the beginning.
Conclusion
This is where I left it. I’m pretty sure there are plenty things one could have done better, but I was satisfied with the performance. You can find the complete source code on GitHub as a Gist.
For a nice overview, I visualized the difference in performance for all three approaches we discussed. I’ve also added a fourth measurement where I decomposed the vectors, but didn’t apply any vectorization. Interestingly, it gives a nice speedup as well.