This paper discusses the state-of-the-art fast software implementation of block ciphers on Intel’s new microprocessor Core2,
particularly concentrating on “bitslice implementation”. The bitslice parallel encryption technique, initially proposed by
Biham for speeding-up DES, has been successful on RISC processors with many long registers, but on the other side bitsliced
ciphers are not widely used in real applications on PC platforms, because in many cases they were actually not very fast on
previous PC processors. Moreover the bitslice mode requires a non-standard data format and hence an additional format conversion
is needed for compatibility with an existing parallel mode of operation, which was considered to be expensive.
This paper demonstrates that some bitsliced ciphers have a remarkable performance gain on Intel’s Core2 processor due to its
enhanced SIMD architecture. We show that KASUMI, a UMTS/GSM mobile standard block cipher, can be four times faster when implemented
using a bitslice technique on this processor. Also our bitsliced AES code runs at the speed of 9.2 cycles/byte, which is the
performance record of AES ever made on a PC processor. Next we for the first time focus on how to optimize a conversion algorithm
between a bitslice format and a standard format on a specific processor. As a result, the bitsliced AES code can be faster
than a highly optimized “standard AES” code on Core2, even taking an overhead of the conversion into consideration. This means
that in the CTR mode, bitsliced AES is not only fast but also fully compatible with an existing implementation and moreover
secure against cache timing attacks, since a bitsliced cipher does not use any lookup tables with key/data-dependent address.
Keywords Fast Software Encryption - Bitslice - AES - KASUMI - Core2