because shift happens
explore parallel worlds
In a previous post I described iteration space partitioning as one way of improving cache residency of data. How much of a speedup does it deliver, really?
Matrix multiply is a good vehicle to illustrate the memory wall effect – the plots below show performance with increasing matrix dimension / storage layout combinations, for both the familiar multiplication loop, and a locality optimizing block multiply algorithm.
Being both computation and bandwidth intensive, matrix multiply has performance characteristics similar to many large problems – VaR and large portfolio simulations in particular. Of course, if it is matrix algebra that you need, an auto (ATLAS) or vendor tuned (MKL et al.) library is best.
Locality improvement (both clever arrangement of data in memory and loop twiddling) can and does yield significant speedup.
The measurements use four storage layouts, for matrix dimensions ranging from 32×32 to 2048×2048. The code and stats used are here.
| A-matrix | B-matrix | |
|---|---|---|
cc |
column major | column major |
cr |
column major | row major |
rc |
row major | column major |
rr |
row major | row major |
Two algorithms are compared; the familiar, simple three deep loop:
for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
sum = 0;
for( int k = 0; k < K; k++ ) {
sum += A[ i, k ] * B[ k, j ]
}
C[ i, j ] = sum;
}
}
and the block multiply algorithm below:
for( int k = 0; k < A.nCols(); k += K_BK ) {
for( int i = 0; i < C.nRows(); i += I_BK ) {
for( int j = 0; j < C.nCols(); j += J_BK ) {
for( int i1 = i; i1 < min( i + I_BK, C.nRows() ); i1++ ) {
for( int j1 = j; j1 < min( j + J_BK, C.nCols() ); j1++ ) {
T sum = C[ i1, j1 ];
for( int k1 = k; k1 < min( k + K_BK, A.nCols() ); k1++ ) {
sum += A[ i, k ] * B[ k, j ];
}
C[ i1, j1 ] = sum;
}
}
}
}
}
On Xeon 5460 cores (both single and 8-way parallel), simple multiply hits a brick wall once matrix dimension reaches 2048×2048.
Removing just the extreme (2048×2048) measurement, the behavior is still far from the cubic curve you might expect; speed is very sensitive to memory layout (when one or both matrices are accessed with a large stride, the memory controller’s ability to deliver multiple words per bus cycle is wasted).
![]() |
![]() |
With the exception of the case when A is row-major and B is column-major, performance with large matrices is unstable. By contrast, the block algorithm is predictable. The next plot compares the best performing case (rc) of the simple algorithm to block matrix performance – runtimes for the block algorithm get increasingly better as matrix size grows. At 2K x 2K, better locality delivers a 2.25x speedup.
The block algorithm is relatively insensitive to storage layout. On the flip side, block sizes need tuning for best performance on a given machine.
![]() |
Finally, the parallel comparisons (the examples use OpenMP), plotting MFLOP rates against matrix size confirm the normal intuitions. Knowing the inflection points is useful to make the best choices – a few loop tweaks and a rearrangements of memory can do wonders for speedup!
![]() |
| T | F | S | S | M | T | W |
|---|---|---|---|---|---|---|
| « Aug | ||||||
| 1 | ||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 |
| 23 | 24 | 25 | 26 | 27 | 28 | 29 |
| 30 | ||||||
© 2010 eigen.systems
Wordpress Themes by (DT)
Leave a reply