In a previous post I described iteration space partitioning as one way of improving cache residency of data. How much of a speedup does it deliver, really?

Matrix multiply is a good vehicle to illustrate the memory wall effect – the plots below show performance with increasing matrix dimension / storage layout combinations, for both the familiar multiplication loop, and a locality optimizing block multiply algorithm.

Being both computation and bandwidth intensive, matrix multiply has performance characteristics similar to many large problems – VaR and large portfolio simulations in particular. Of course, if it is matrix algebra that you need, an auto (ATLAS) or vendor tuned (MKL et al.) library is best.

Locality improvement (both clever arrangement of data in memory and loop twiddling) can and does yield significant speedup.

The measurements use four storage layouts, for matrix dimensions ranging from 32×32 to 2048×2048. The code and stats used are here.

A-matrix B-matrix
cc column major column major
cr column major row major
rc row major column major
rr row major row major

Two algorithms are compared; the familiar, simple three deep loop:


for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
sum = 0;
for( int k = 0; k < K; k++ ) {
sum += A[ i, k ] * B[ k, j ]
}
C[ i, j ] = sum;
}
}

and the block multiply algorithm below:


for( int k = 0; k < A.nCols(); k += K_BK ) {
for( int i = 0; i < C.nRows(); i += I_BK ) {
for( int j = 0; j < C.nCols(); j += J_BK ) {
for( int i1 = i; i1 < min( i + I_BK, C.nRows() ); i1++ ) {
for( int j1 = j; j1 < min( j + J_BK, C.nCols() ); j1++ ) {
T sum = C[ i1, j1 ];
for( int k1 = k; k1 < min( k + K_BK, A.nCols() ); k1++ ) {
sum += A[ i, k ] * B[ k, j ];
}
C[ i1, j1 ] = sum;
}
}
}
}
}

On Xeon 5460 cores (both single and 8-way parallel), simple multiply hits a brick wall once matrix dimension reaches 2048×2048.

Removing just the extreme (2048×2048) measurement, the behavior is still far from the cubic curve you might expect; speed is very sensitive to memory layout (when one or both matrices are accessed with a large stride, the memory controller’s ability to deliver multiple words per bus cycle is wasted).

With the exception of the case when A is row-major and B is column-major, performance with large matrices is unstable. By contrast, the block algorithm is predictable. The next plot compares the best performing case (rc) of the simple algorithm to block matrix performance – runtimes for the block algorithm get increasingly better as matrix size grows. At 2K x 2K, better locality delivers a 2.25x speedup.

The block algorithm is relatively insensitive to storage layout. On the flip side, block sizes need tuning for best performance on a given machine.

Finally, the parallel comparisons (the examples use OpenMP), plotting MFLOP rates against matrix size confirm the normal intuitions. Knowing the inflection points is useful to make the best choices – a few loop tweaks and a rearrangements of memory can do wonders for speedup!

  • - Below a certain size, it is best to use the simplest algorithm as the parallel overheads are overwhelming.
  • - If it is possible, arranging the storage in a cache and memory controller friendly layout pushes the performance envelope of every algorithm upto a point.
  • - Simply parallelizing the simple algorithm is best for large data sets.
  • - As simple parallel performance begins to decay, a locality enhancing algorithm will outperform.