Serpent S Boxes as Boolean Functions
The fastest implementations of Serpent depend on finding efficient boolean function decompositions for the eight Serpent S boxes and their inverses. One way of looking for such expressions is to perform a recursive search of all combinations of primitive boolean terms up to some limit. Here is the program I use to do this - it works simply by looking for the minimum number of terms required to express the S box in question.
Sam Simpson has been running this program as a background task on a cluster of high performance servers. After a search involving around 1000 machine hours we have found the S box decompositions given here.
Although these are the S boxes we have found with the minimum number of terms, it is not always true that these expansions give the best results. On the Pentium Pro and Pentium II dependency chains, out of order execution and cache line relationships between temporary variables means that the results can vary considerably as small changes are made in the way the S box functions are actually calculated.
The functions given above are those directly output by the analyser. On any particular machine it will be desirable to experiment with the order of terms (where the is quite a lot of flexibility) and with the reuse of the temporary variables used during function evaluation (the raw analyser output does not reuse temporaries).
Pentium II/III without MMX
Dag Arne Osvik has recently used a new search technique to find S box boolean functions tuned for the Pentium architecture. I have coded Serpent in assembler using these functions and obtained the following performance results (using my AES2 testing environment). The testing was undertaken on a 600MHz PIII system but the speed figures are given for the 200MHz reference platform.
Serpent maps well onto the Pentium MMX registers since the bit-slice technique it uses allows two blocks to be processed in parallel by placing the corresponding 32-bit words of each block into the upper and lower 32-bit words of each MMX register respectively. This technique is illustrated by this assembler implementation using the NASM assembler with a Microsoft Visual C/C++ test harness. This implementation uses a pre-computed key schedule and is tuned for bulk encryption with the following results (cycles for two 128-bit blocks with the resulting speed based on the 200MHz reference platform):
There is a penalty compared with non-MMX code because the MMX instruction set lacks a rotate instruction but this is more than compensated by the ability to process two blocks at a time. The use of S boxes that do not require table lookups is also a considerable benefit since the whole algorithm can be run in the MMX registers with only the key schedule requiring memory accesses.