Gray Codes Addition — Harold Lucal's Addition
Disclaimer: this article is far from a theorical one, it’s more of an optimization story, mostly taken from the corresponding Q&A on Code Review that I wrote a while ago.
Exit the trivial and simple operations: here comes the addition. Apparently adding two Gray codes without having to convert the operands to two’s complement integers, adding these integers and converting the result back to a Gray code is a problem that hasn’t been well-researched, probably due to the efficiency of the \(to\_gray\) and \(from\_gray\) operations, which allow us to perform the addition in \( O(\log{n}) \) time on usual architectures.
In this article, we present and optimize an addition algorithm described by R. W. Doran in The Gray Code and attributed to Harold Lucal. Here is the original pseudocode, taken straight from the aforementioned paper:
This algorithm runs in \(O(n)\) time where \(n\) in the number of bits in the Gray code. Here is a naive C++ translation of the algorithm above, implemented using the cpp-gray library:
The rest of this article is composed of subsections describing various incremental benchmark-driven optimizations to this algorithm.
Loop invariants code motion
This is the name of an optimization performed by most compilers: when they detect code that actually does not depend on the state of the loop, they move it out of the loop. It’s not as obvious as the usual invariants, but let’s have a look at this line:
lhs_i
and rhs_i
correspond to the \(i_{th}\) bit of lhs
and rhs
. These bits do not depend on the loop (the loop
does not change them) and can all be computed at once by performing lhs ^ rhs
before running the loop. Therefore, we can
simplify the function by moving their computation out of the loop. If we store them directly in res
, we can even simplify
the assignment to res
to a mere res ^= res_i << i
:
Unfortunately, lhs_i
and rhs_i
are also used to compute the new values of lhs_p
and rhs_p
, so we can’t remove them
from the body of the loop. If we are to optimize the algorithm even more, we will need something else, for example…
Getting rid of the temporaries
The part of the Gray addition which is likely to be be the slowest is the body of the for
loop. To find a possible
optimization, I decided to check whether it was possible to get rid of lhs_p_cpy
and rhs_p_cpy
, which are copies of
lhs_p
and rhs_p
required to compute the new values of lhs_p
and rhs_p
. While it is easy to get rid of rhs_p_cpy
by
replacing it by rhs_p
directly, getting rid of lhs_p_cpy
is harder because lhs_p
has already been updated in the
meantime. Without any trickery, the following loop body is the best we can get:
Therefore, I decided to build some truth tables. In the following table, \(lhs_{old}\) and \(rhs_{old}\) are possible
values of lhs_p
and rhs_p
before the update while \(lhs_{new}\) and \(rhs_{new}\) are the values of the same
variables after the update. I did not take any interest in lhs_i
and rhs_i
since none of them is used to compute both
\(lhs_{new}\) and \(rhs_{new}\); therefore, I just ignore them in the rest of this reflection.
We can easily replace lhs_p_cpy
by \(lhs_{old}\) in the computation of \(lhs_{new}\) but we can’t use it to compute
\(rhs_{new}\) since the update already occured. My first thought to get rid of lhs_p_cpy
was « can we compute
\(rhs_{new}\) with only \(rhs_{old}\) and \(lhs_{new}\)? ». Looking at the truth table above, it appears that it is
not possible. Trying to compute \(rhs_{new}\) before \(lhs_{new}\) doesn’t solve the problem either.
However, we have already computed another value in the loop: res_i
. Injecting \(res_i\) in the previous truth table
yields the following table:
We can infer from this « truth table » that \(rhs_{old} \land \lnot lhs_{old} = rhs_{old} \land \lnot res_i\). If we
inject this new discovery back in the code, we can use it to totally get rid of lhs_p_cpy
:
By the way res_i
can also be used to compute \(lhs_{new}\) so that the code looks more symmetrical (even though it
does not improve the performance):
With all optimizations turned on, this code is a little bit faster than the previous version, but it’s still far from being as efficient as the version which converts the Gray code operands to a two’s complement integer, performs an integer addition then converts the result back to a Gray code. Additional twisting will be needed to optimize this algorithm even more.
Find a better end condition
Currently, the algorithm loops over every bit in res
before ending. It might be possible to end the loop sooner, so let’s
have a look at a new huge truth table that we will use for the rest of the article. For the sake of brevity, I used the
names from the original pseudocode to name the columns. I hope that the parallel with the variable names in the code isn’t
too hard to make (\(A = lhs_i\), \(B = rhs_i\), \(E_{old} = lhs_p\), \(F_{old} = rhs_p\),
\(S = A \oplus B \oplus res_i\), \(E_{new} = lhs_p'\), \(F_{new} = rhs_p'\))
Looking at the very first line of that table, we can see that when \(A = B = E_{old} = F_{old} = 0\), we have
\(S = E_{new} = F_{new} = 0\), which means that since our result was filled with \(A \oplus B\) right from the start,
then we can stop the loop when the aforementioned condition is met (everything equals \(0\)) since the following bits will
always be \(0\). An easy way to do so is to “consume” lhs
and rhs
in the code with right shifts so that they eventually
reach \(0b0\) and to change the loop condition with bitwise ORs (here, only the for
is exposed loop for the sake of
brevity):
While it might reduce the running time of the algorithm for small values (a higher most significant set bit means a bigger
value in Gray code), lhs | rhs | lhs_p | rhs_p
is still a bunch of operations by itself for a loop condition. Our goal is
to stop the loop iterating when all of these variables are \(0\). One thing we can do is find before the loop which of lhs
and rhs
has the highest set bit and loop until that one is consumed. Since both lhs
and rhs
are consumed at the same
speed, when the one with the highest set bit is consumed, the other one has already been consumed, so we only have to check
whether the biggest of the two is \(0\). We can easily find which one has the highest bit by checking which one has the
biggest value when considering the underlying bits as a two’s complement integer. Here is the algorithm once modified:
Have a better look at the truth table
From the big truth table we built, we can see that when \(A = B = 0\), the only time when \(S = 1\) is when
\(E_{old} = F_{old} = 1\). However, when \(A = B = 0\), we can also see that \(E_{new}\) and \(F_{new}\) can never
be \(1\) at the same time. That means that once lhs
and rhs
are consumed, res
can change only one more time and only
if (lhs_p & rhs_p)
is true
. Therefore, we can extract the lhs_p | rhs_p
part from the loop condition and apply exactly
one res
assignment after the loop (I used a while
loop since the for
loop was becoming a bit messy for the job):
Sometimes, more is less
Currently, we use the fact that at some point both lhs
and rhs
will eventually become \(0\) and that only one
modification can occur right after this point. However, we don’t use the fact that at some point rhs
is \(0\) while lhs
might not yet be totally consumed. We can actually write two loops instead of one: one loop that runs until the smallest of
lhs.value
and rhs.value
is \(0\), and another that runs until the other one reaches \(0\). The second loop can be
simplified since we know that one rhs
(the smallest value from the std::swap
) is \(0\) during the iteration:
Rethink the logic
The second loop can be rethought as “what happens when res_i
is \(0\) and what happens when it is \(1\)”? In terms of
code, with some simplifications applied, it yields the following loop:
While this algorithm is actually generally slower than the previous one because of branching, it permits to realize some
interesting things: one of the branches always sets rhs_p
to false
and cannot be executed when rhs_p == false
. The
other branch does not update rhs_p
which means that the loop can be uderstood as « do stuff while rhs_p is true
», so we
can rewrite the original second loop (not the new transformed second loop) as follows:
Many simplifications have been applied since we know during this loop that rhs_p == true
. We end up with a second loop in
the algorithm almost free compared to the first one, which means that when we add a number with a big highest set bit to one
with a small highest set bit, it could perform significantly faster than the previous versions. Also note that since the
last loop ends when rhs_p == false
, then rhs_p & lhs_p
is always false
in the post-last-loop operation, so we can get
rid of it altogether.
Now, we can observe that the loop updates res
with lhs_p
until rhs_p
is false
. However, rhs_p
is set to
not lhs_p
which means that res
is updated with lhs_p
while lhs_p
is \(0\) except for the last iteration. In other
words, res
is meaningfully updated only once since when we reach lhs_p == 1
, we leave the loop right after the next
iteration. So we can tweak the loop to kick the res
update out of it:
Combined one after the other, these optimizations leave us with the following algorithm:
While this is the most versatile and optimized version of the algorithm since the beginning of this answer, it is still an order of magnitude slower than converting the Gray code operands to two’s complement integers, performing a regular addition and converting the result back to a Gray code. Which isn’t surprising either since we’re comparing a \(O(n)\) algorithm to a \(O(\log{n})\) one in the end.
It’s also worth noting that this algorithm does not provide a circular behaviour on overflow.