Implementations of AES (Rijndael) in C/C++ and Assembler
I have recently updated my AES code and the new version is available here. Since the interface has changed, the previous version remains available here but it won't now be updated. The main aim of the new code is to concentrate on AES only and to offer a simpler interface with more compile time parameter checking. The new code also supports x86 assembler versions for both integer and MMX operations and can achieve about 17 cycles per byte on a P4 processor (the best code achieves about 14 cycles per byte but is not free).
I have taken an interest in helping to ensure that the Rijndael and AES specifications are effective from an implementation perspective and I wrote an early input paper to the US NIST AES FIPS development with this in mind. I have since further developed this original input document into a full description of Rijndael which I make available here as a Adobe Acrobat PDF file.
Algorithm Code in C/C++
This code implements both AES and Rijndael. The standard code implements block sizes of 16, 24 and 32 bytes, fixed during compilation, and a variable block size option covering these block sizes chosen at time of use. Each of these options operates with key sizes of 16, 24 and 32 bytes chosen at time of use. An alternative implementation offers block and key sizes of 16, 20, 24, 28 and 32 bytes.
The standard implementation provides AES when implemented with a block size of 16 bytes. This is heavily optimised, especially for the 16 byte key size. The variable block size option is much slower and is not recommended unless this is really needed. The alternative implementation is also less optimised.
The code is arranged so that encryption and decryption operations are entirely separate so that encryption only and decryption only applications can be produced without the overhead of the other mode. This has been possible because speed optimisations have allowed the decryption schedule to be compiled directly without relying on the encryption key schedule without compromising speed.
The C interface is:
The code has also been optimised so that only the tables required are compiled into the code. It has also been divided into components so that the compilation options can be more easily set (these are also more fully explained). Lastly there is fully integrated x86 assembler code for the standard AES encryption and decryption operations.
This code supports all Rijndael block sizes. It can be compiled with a fixed or a variable block size (although the latter involves a very significant performance penalty). The block and key sizes are specified in units of bytes to match the associated input arrays (legal values 16, 24 and 32) . The input parameters are checked for correctness and the encrypt and decrypt routines check that an appropriate key has been set up. The functions hence return a success or failure value. For the variable block size version there is a call to set the block size, a value of 16 being assumed if this is not set (i.e. the AES standard block size).
The source code for the algorithm is in C but there is a C++ interface as well. The main files are:
Algorithm Code in Pentium Family Assembler
There is an assembler code source file for the encryption and decryption subroutines for the Pentium family (Pentium II/III/IV). This version only implement the standard block size of 16 bytes (128 bits) but is 20% faster than the C/C++ code. It achieves a maximum speed with a fully primed processor cache of about 280 cycles/block, which is around 58 Mbytes/second on a 1GHz processor. Note also that it uses the Microsoft VC++ register saving conventions and may need to be modified to work with other C/C++ compilers. The code uses the NASM assembler available here and integrates with the C/C++ code for key scheduling and table generation.
If you need still more speed Helger Lipmaa has a commercial version here that achieves around 229 cycles/block.
This zip file contains a full set of round values for each of the 25 block and key length combinations from 128, 160, 192, 224 and 256 bits for one input block and one key value.
These zip files contain proposed new variable key and variable text test vectors proposed by Paulo Barreto and myself as replacements for the current versions. These are designed to assist in finding errors in byte order within input, output and key blocks.
Dynamic Link Libraries
The code can be compiled into a Dynamic Link Library and the file aesdll.zip contains a DLL for the AES standard (fixed) block size of 16 bytes. Other versions of this DLL can be compiled from the source code if needed.
This DLL can be called from any language that supports DLL use, including both Microsoft Visual Basic and Visual Basic for Applications (VBA). This zip file provides an example of use in VBA hosted in a Microsoft Word document that contains example VBA source code as text and as a macro that can be run to show its operation.
I am often asked for an example of how to use the algorithm code on this page so I have now produced a simple file console mode encryption application. The file aesxam.c encrypts a file with a user provided key using a command line as follows:
aesxam input_file_name output_file_name [d|e] hexadecimal_key_digits
aesxam aes.c aes.enc e 0123456789abcdeffedcba9876543210
will encrypt aes.c to aes.enc with the key given. The file is then decrypted by using 'd' instead of 'e'. Here is a zip file containing the example source code and a binary executable for Windows in console mode. This is an example application that is not intended for real use, I believe it works correctly but I cannot guarantee this. Hence if you do use it, you do so at your own risk.
A Note on Algorithm Speed
The timing values quoted here are the best available (i.e. the fastest) from this code and are not representative of the sort of speed that will be achieved in practice, especially when small numbers of blocks are being processed. This is because the way I do the timing is by running the code on a particular set of input values before I time it using the same values. I am hence giving the speed achieved with the processor's data cache fully primed.
In encryption (or decryption) operations involving a small number of blocks the processor cache will not necessarily be primed and this will mean that the performance will be severely degraded compared with the figures given here. However as the number of blocks being encrypted increases, the processor cache progressively accumulates values from the tables and the performance for large numbers of blocks becomes close to the best figures given here.
For the highest speed, this code uses a maximum of 20 tables of 256 32-bit words, a total of 20Kbytes of data.
I don't offer the other test vectors because the files are too large.
All the source code used to create the application is available on this page so you can check it out yourself. They are also quite simple since both the applications and the AES algorithm are not difficult to understand and this makes them useful for those who believe that security is best achieved by keeping things simple.
I am happy for this code to be used without payment provided that I don't carry any risks as a result. I would appreciate an appropriate acknowledgement of the source of the code if you do use it in a product or activity provided to third parties. I would also be grateful for feedback on how the code is being used, any problems you encounter, any changes or additions that are desirable for particular processors and any more general improvements you would like to see (no promises mind!).
I would like to thank the following individuals who have made contributions in the development of this code: