A cache oblivious matrix transposition algorithm is implemented and analyzed using simulation and hardware performance counters.
Contrary to its name, the cache oblivious matrix transposition algorithm is found to exhibit a complex cache behavior with
a cache miss ratio that is strongly dependent on the associativity of the cache. In some circumstances the cache behavior
is found to be worst than that of a naïve transposition algorithm. While the total size is an important factor in determining
cache usage efficiency, the sub-block size, associativity, and cache line replacement policy are also shown to be very important.