because shift happens
explore parallel worlds
The first step in migrating a sequential program to parallel code almost always involves identifying opportunities for concurrency. “Almost”, because many useful applications are clearly data-parallel – which is why SPMD (single program, multiple data) grids are so popular.
By adapting a grid-friendly application to many-core parallelism, significant speedup can be gained. We’ll use a common grid application – portfolio VaR – as our first example. For a large number of simulated scenarios, a population of transactions is valued. The values computed for each simulation are sampled in a histogram for analysis. We will pretend that there is a histogram per trade, though, in practice some level of aggregation is common.
The sequential logic (stylized) looks something like:
histogram.init( N_SIM );
for ( i = 0; i < N_SIM; ++i ) {
for ( k = 0; k < N_TRADE; ++k ) {
histogram.sample( k, trade[ k ].value( scenario[ i ] ) );
}
}
Applying map-reduce, it is possible either to partition the set of sample scenarios, or the set of trades to be valued; the typical grid application will execute a sequential process on many cores, with one partitioned subset of data assigned per core to each. When all partitions are processed, a reduction step consolidates the results.
In such implementation, each core acts in isolation. Recent improvements in processor architecture offer further, substantial performance gains resulting from significant size increases in shared, on-chip cache. To realize this gain, logic must be restructured so that the “memory wall” cost is minimized; as the indications below show, main memory access is expensive.
the memory wall
As the table below shows, main memory is plentiful, but many times more expensive to access. The cache subsystem is much faster, but also small. Logic flow must be explicitly designed to take advantage of it. Latencies vary with processor technology, but the relative cost of a main memory access is high.
|
[1] Inside Nehalem [2] Memory Stall |
parallelize this!
A direct parallel implementation using the parallel_for template in Intel TBB is shown below. There is some performance gain relative to a distributed grid, since scenarios need be generated once and can be shared by all threads.
Ordering trades by type will help performance by improving cache residency of pricing algorithm code.
class Simulate {
const ScenarioList & _scenarios;
const TradeList & _trades;
Histogram & _histogram;
public:
Simulate(
const ScenarioList & scenarios,
const TradeList & trades,
Histogram & histogram
) : _scenarios( scenarios ), _trades( trades ), _histogram( histogram ) {}
void operator()( const blocked_range & range ) const {
for( size_t i = range.begin(); i != range.end(); ++i ) {
for( size_t k = 0; k < N_TRADE; k++ )
_histogram.sample( k, _trade[ k ].value( _scenarios[ i ] ) );
}
}
}
};
...
parallel_for( blocked_range( 0, N_SIM ), Simulate( scenarios, histogram ) );
the kernel partitioner
A closer look at the memory access behaviour of the direct parallel_for reveals a high cost in main memory cycles. Since neither the trade not scenario sets are small enough to fit into cache, each trade must be reloaded once for every scenario.
Borrowing a pattern from GPU programming, the EigenSpace::KernelPartition pattern divides the data set into “stripes”. A stripe contains a number of kernels, each of which is bound to a set of trades and a set of scenarios.
Kernels contained in a stripe are executed in parallel, iteration over stripes is sequential. Each stripe references a disjoint set of trades, ensuring that each trade is loaded only once. Scenarios are re-loaded once per stripe.
|
Given:
| with striping | O[ ( T + S * Ns / C ) * C ] |
| without striping (Ns = T) | O[ T * S ] |
| limiting case (Ns = 1, C = 1) | O[ T + S ] |
class Simulate {
Histogram & _histogram;
public:
Simulate( Histogram & histogram )
: _histogram( histogram ) {}
void operator()( const Scenario & aScenario, const Trade & aTrade ) const {
_histogram.sample( k, aTrade.value( aScenario ) );
}
};
...
typedef EigenSpace::KernelPartition<TradeList, ScenarioList, Simulate> Stripes;
...
Stripes stripes( trades, scenarios );
stripes.execute();
Graphics cards already approach the limiting case and can really light the afterburners, with 128-512 cores and 256M-1G of high-speed VRAM. However, many pricing analytics libraries are large and have deep, intertwined inheritance hierarchies that are not easily expressed as GPU compilable kernels.
(Model.Bricks 2.0 will offer GPU kernels for common computations in finance. Watch this space for announcements!)
| T | F | S | S | M | T | W |
|---|---|---|---|---|---|---|
| « Aug | ||||||
| 1 | ||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 |
| 23 | 24 | 25 | 26 | 27 | 28 | 29 |
| 30 | ||||||
© 2010 eigen.systems
Wordpress Themes by (DT)
Leave a reply