eigen.systems

because shift happens

  • home
  • about us
  • press room
  • eigen

welcome to our blog!

explore parallel worlds

8
August

the only planet with chocolate

Posted by Sarah Richards | In: green computing

In Helsinki, surplus heat from hundreds of computer servers located in a data center below Uspenski Cathedral will be used to heat hundreds of homes. Full Article

Near Berlin’s Tegel airport, rising heat from a data center is reported to cause turbulence for passing aircraft.
Full Article

The IT industry already has a carbon footprint matching airlines, and it is not going to get any better anytime soon. The same goes for energy cost for operating and cooling these giant information factories.

In a world where ‘green’ has become king and recycling is more than just the ‘right thing to do’. People are looking under every rock, tree and CPU for ways to ‘save the earth’.

Who would have guessed that IT + Data Centers = high carbon emissions?

Data centers are a necessary evil driving business costs into the stratosphere. The process of real-time information gathering demands energy, causing the current server construct to create an environmental and financial hazard, with seemingly no relief in sight.

Datacenters – Emissions in mt (Mt = thousands of metric tons)

US datacenters
170 Mt
Argentina 142 Mt
Netherlands 146 Mt
Malaysia 178 Mt

A focus on green technology is becoming a main objective for the modern business.

Better, more effective use of many-core and GPU computing is one path to reducing operating cost and environmental impact.

Adoption of GPU acceleration is slowly but surely on increase, with better operating support beginning with Snow Leopard and Windows 7. There is some technology flux, and some debate about which technology is faster, or easier to program – but the migration to parallel programming is inevitable.

At eigen.systems we are doing our part to deliver efficiency and performance. The eigen.spaces library makes it easier to modify existing applications to effectively use many-core processors. Model.Bricks provides a suite of framework components that allow selection of either many-core CPU or GPU algorithms for most functions, allowing fine control, best performance and the most efficient use of available hardware.

By providing tools that make it easier for developers to focus on their requirement while leveraging close-to-the-metal parallel programming eigen.systems is helping to promote green computing, generating savings in costs as well as deliver better processing performance.

Save the earth, its the only planet with chocolate!

  • 0 Comments
  • Tags: carbon footprint, Eigen.Spaces, green computing, Model.Bricks
19
May

The need for speed: the world of High Frequency Trading

Posted by Sarah Richards | In: HFT

Traders - AP Photo Since the unexpected and significant drop in the DOW on May 6th, articles have permeated the net, speculating the cause of the event.

Working at eigen.systems, my eyes are always peeled for articles that talk about High Frequency Trading (HFT), especially when the significance of computing in this industry is under discussion.

Overall, regardless of whether persons blame or praise the role of computers in todays age of HFT, one thing is for sure, computers are here to stay. The key questions that abound seem to be how do we make them more accurate, faster and more efficient?

There is a great article in today’s New Zealand Herald, which I think both novices and industry professionals do well to read. It explains quite simply what HFT is, how the world of trading has changed over the years with the introduction of faster computation as well as predictions for the future. It’s a great read, so check it out!

AP photo

  • 0 Comments
  • Tags: high frequency trading, parallel computing
26
April

mind the cache

Posted by Rakesh | In: patterns

In a previous post I described iteration space partitioning as one way of improving cache residency of data. How much of a speedup does it deliver, really?

Matrix multiply is a good vehicle to illustrate the memory wall effect – the plots below show performance with increasing matrix dimension / storage layout combinations, for both the familiar multiplication loop, and a locality optimizing block multiply algorithm.

Being both computation and bandwidth intensive, matrix multiply has performance characteristics similar to many large problems – VaR and large portfolio simulations in particular. Of course, if it is matrix algebra that you need, an auto (ATLAS) or vendor tuned (MKL et al.) library is best.

Locality improvement (both clever arrangement of data in memory and loop twiddling) can and does yield significant speedup.

The measurements use four storage layouts, for matrix dimensions ranging from 32×32 to 2048×2048. The code and stats used are here.

A-matrix B-matrix
cc column major column major
cr column major row major
rc row major column major
rr row major row major

Two algorithms are compared; the familiar, simple three deep loop:


for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
sum = 0;
for( int k = 0; k < K; k++ ) {
sum += A[ i, k ] * B[ k, j ]
}
C[ i, j ] = sum;
}
}

and the block multiply algorithm below:


for( int k = 0; k < A.nCols(); k += K_BK ) {
for( int i = 0; i < C.nRows(); i += I_BK ) {
for( int j = 0; j < C.nCols(); j += J_BK ) {
for( int i1 = i; i1 < min( i + I_BK, C.nRows() ); i1++ ) {
for( int j1 = j; j1 < min( j + J_BK, C.nCols() ); j1++ ) {
T sum = C[ i1, j1 ];
for( int k1 = k; k1 < min( k + K_BK, A.nCols() ); k1++ ) {
sum += A[ i, k ] * B[ k, j ];
}
C[ i1, j1 ] = sum;
}
}
}
}
}

On Xeon 5460 cores (both single and 8-way parallel), simple multiply hits a brick wall once matrix dimension reaches 2048×2048.

Removing just the extreme (2048×2048) measurement, the behavior is still far from the cubic curve you might expect; speed is very sensitive to memory layout (when one or both matrices are accessed with a large stride, the memory controller’s ability to deliver multiple words per bus cycle is wasted).

With the exception of the case when A is row-major and B is column-major, performance with large matrices is unstable. By contrast, the block algorithm is predictable. The next plot compares the best performing case (rc) of the simple algorithm to block matrix performance – runtimes for the block algorithm get increasingly better as matrix size grows. At 2K x 2K, better locality delivers a 2.25x speedup.

The block algorithm is relatively insensitive to storage layout. On the flip side, block sizes need tuning for best performance on a given machine.

Finally, the parallel comparisons (the examples use OpenMP), plotting MFLOP rates against matrix size confirm the normal intuitions. Knowing the inflection points is useful to make the best choices – a few loop tweaks and a rearrangements of memory can do wonders for speedup!

  • - Below a certain size, it is best to use the simplest algorithm as the parallel overheads are overwhelming.
  • - If it is possible, arranging the storage in a cache and memory controller friendly layout pushes the performance envelope of every algorithm upto a point.
  • - Simply parallelizing the simple algorithm is best for large data sets.
  • - As simple parallel performance begins to decay, a locality enhancing algorithm will outperform.
  • 0 Comments
  • Tags: block algorithms, locality, OpenMP
9
April

ParaPLoP ‘10

Posted by Rakesh | In: announcements

Its been a week since I got back from ParaPLoP ‘10 workshop – its was great to meet with a group of people doing so much to bring parallel programming into the mainstream, and it was great learning.

Ade Miller presented the Task Graph pattern from Microsoft TPL. As a TBB user, I find strong parallels between them; TPL is elegant, and I am keen to learn more. Arguably, some of it is syntactic sugar, but sugar is sweet! I have resisted Microsoft for many years, but it is time to concede and assimilate.

Just as interesting as the workshops were the side conversations – It was good also to get a sense of the future (and direction of TBB) from its architect, Arch Robison – TBB 3.0 will be out soon, and I look forward to it.

Dick Gabriel’s talk on the works of Chistopher Alexander was fascinating. The architect’s work has been an inspiration for the design patterns community.

And Ralph Johnson’s view of programs as transformations struck a chord – it is indeed true the most of us revisit the same applications / algorithms over and over again, so making programs parallel is indeed a transformation. Documenting patterns is about sharing best practices, to help the rest of us with that transformation process.

Very interesting for me personally, was hearing from Tim Mattson about the consequences for software developers of what’s cooking in the silicon furnaces at Intel. I have a feeing things are going to get interesting.

Processors cannot be clocked much faster any more – power consumption and heat dissipation have seen to that. But the consequences go beyond clock speed limits. Current generations of processors have deep pipelines and out-of-order instruction scheduling, to hide memory access and other internal latency.

Instructions are not executed in the order they are laid out – if the execution of an instruction is going to be stalled because its operands are not yet available, other instructions are executed instead. Optimized instruction scheduling is done by the hardware at run time.  Quite likely, newer processors will ditch some of this complexity to accommodate more cores.

That leaves a hard optimization job for compilers – some of these optimizations cannot be statically done.

I take two lessons from this:

1. Where possible, use vendor libraries. MKL, NAG for math, for example. Let the vendors deliver processor optimized versions. I’ve come across variants of Numerical Recipes code a lot; which will keep application developers on an optimization treadmill.

2. Parallel programming is no longer just for speed junkies – to maintain current performance, parallelism is going to be needed.

What do you think?

  • 0 Comments
9
April

“Clothes make the man,” so the saying goes. Developing UI

Posted by admin | In: design

A good user interface makes (or breaks) the software, application, or web site. It’s the difference between the user’s happiness and how far they’ll end up tossing the product out their figurative (or in some cases, actual,) window.

The challenge is to create a user interface that meets both the needs of the business as well as the user.  At times this is easier said than done.  Developers and designers have the unenviable task of trying to understand one another enough so that the end result is a usable, workable UI.

A developer may have experience constructing the front-end and back-end of an application but is not versed in design. The reverse is true for designers. Two different languages and thought processes trying to culminate in a cohesive product everyone can love.

To this end, Information and communication is key.  Understanding your user’s mindset is another.

For instance: In today’s plug-n-play mind set, users are rarely inclined to search the ‘help’ section of their application for answers to a specific question (much to the developer’s dismay). Therefore an easy to understand interface, one that teaches your user how to use the application, must be constructed.  Integrating helpful hints, tips and instruction into the application as the user works with the product can be a good way to overcome possible obstacles. Helpful hints with the user option to disable them later is also a good way to go.

Pitfalls can occur when the development team has become caught up in the latest trends, colors, and bells and whistles. This is where the motto – ‘Less is More’ – is a mantra many would do well to tattoo across their computer screen.

Identifying your users will be invaluable. Methods of observation, as well as interviews, can help determine the user’s knowledge of systems and computers in general. This also helps to factor in the user’s background and how this will affect the way they use your product.  What are their jobs?

What tasks does the user frequently conduct and how can your product enhance their workflow?  Analyzing these questions can have a profound effect on your application but it is well worth the effort.

  • 0 Comments
16
March

Eigensystems to present at ParaPLoP 2010!

Posted by Sarah Richards | In: announcements

From March 30-31, our very own Rakesh Joshi, co-founder of eigen.systems, will be presenting at the 2010 ParaPLoP workshop in Carefree, AZ.

ParaPLoP is an interactive conference where pattern authours present their case studies and share expertise in the field of parallel programming patterns. It is a great arena for pattern professionals and enthusiasts alike to analyze previously published patterns, learn about using patterns to develop parallel software and discover mining patterns from significant parallel code implementations.

Rakesh’s paper, CONCURRENT EVALUATION OF A DIRECTED ACYLIC GRAPH, was chosen to be presented, (exact day and time TBD). The paper discusses latent parallelism and the benefit and application of concurrency across a variety of platforms.

So if you are in the AZ area, sign up to attend the ParaPLop 2010 conference! For more information about ParaPLop 2010, please visit the official conference site here.

  • 0 Comments
  • Tags: Eigen.Spaces, parallel computing, paraplop
24
February

Eigen.Systems launches the eigen.spaces: blackbird application server

Posted by Sameer Tipnis | In: press room

PRINCETON, NJ, February 23, 2010 – Eigen.Systems LLC announced today the launch of eigen.spaces:blackbird, an application server for multi-core parallel applications based on the eigen.spaces framework. Eigen.Systems is a software development company specializing in high-performance parallel applications for risk management and high frequency trading. Blackbird is Eigen.Systems fastest, most powerful application server yet; packed with features such as many-core grid application management, fast messaging, flexible protocol switching, adaptive load balancing and advanced performance instrumentation.

Eigen.Systems has been providing high speed parallel application development tools since the launch of eigen.spaces in 2008. Since then they have continued to enhance eigen.spaces to support an increasingly rich variety of risk management and trading applications. The release of blackbird builds on the eigen.spaces framework to create an environment where performance is maximized through increased concurrency and cache .

Rakesh Joshi of Eigen.Systems said: “The key to getting the most of many-core lies in understanding application behavior under varying loads. Synchronization costs and latencies can best be understood by sampling instrumentation; by combining low overhead instrumentation with the ability to reallocate resources at run-time, blackbird is a tool to determine the best resource assignment and scheduling strategies to meet performance requirements.”

An Intel Software Partner since 2008, Eigen.Systems have developed high performance parallel frameworks based on Intel Threading Building Blocks (TBB) and the Intel Math Kernel (MKL).
Two years ago Eigen.Systems entered the Intel Software program, utilizing high performance processor optimized threading libraries from Intel (TBB / MKL ) in their products, including blackbird.

Blackbird will be available as a limited trial version as well as a full version for purchase.

About Eigen.Systems
Eigen.Systems was founded in 2006 by a group of investment banking technologists, with over 40 years of combined industry experience in risk management and capital markets. With insight born of long experience in all facets of fixed income, FX, equity and credit markets, they strive to build strong customer relationships based on a track record of effective delivery and a keen focus on doing what is best for the client’s business.

Press contact: Sarah Richards – sarah@eigensystems.com

  • 0 Comments
  • Tags: blackbird, Eigen.Spaces
23
February

Eigen.Systems launches eigen.spaces v2

Posted by Sarah Richards | In: press room

PRINCETON, NJ, February 23, 2010 – Eigen.Systems LLC announced today the launch of eigen.spaces v 2, a C++ concurrency toolkit multi core development. Eigenspaces v 2 enhances the functionality and ease of use of eigen.spaces; allowing the user to define object types, specify dependencies and connect data sources with speed and simplicity.

Application patterns such as dependency graphs, data parallel applications and request servers allow existing sequential applications to be easily refactored to take advantage of current and upcoming many-core processors.

In conjunction with blackbird, an application server created by Eigen.Systems, multi-core eigen.spaces based applications can be monitored and tuned for best performance under varying workloads.

High performance I/O is delivered using flexible, asynchronous connectors that can be configured to operate with files, databases or various network protocols.

Concurrent patterns adapted to the needs of computational finance allow for existing sequential analytics to be bound to a parallel execution framework, preserving decades of investment in analytics models while tapping the power of many-core computing.

Metadata controlled serialization mechanisms provide compact, binary serialization to maximize network performance while preserving the ability to communicate with external systems using XML, CSV, JSON and other representations.

Rakesh Joshi of Eigen.Systems says “Processor technology has decisively shifted to multi-core, with the result that hardware upgrades can indeed slow down existing sequential applications. With that, concurrent programming utilizing such tools as eigen.spaces v 2 has become the new imperative if the benefits of new technology are to be realized.”

Evaluation versions of Eigen.spaces v 2 are available upon request. EigenSystems also offer training programs in parallel application development, and consulting services to accelerate migration to multi-core.

An Intel Software Partner since 2008, Eigen.Systems have developed high performance parallel frameworks based on Intel Threading Building Blocks (TBB) and the Intel Math Kernel (MKL).

About Eigen.Systems
Eigen.Systems was founded in 2006 by a group of investment banking technologists, with over 40 years of combined industry experience in risk management and capital markets. With insight born of long experience in all facets of fixed income, FX, equity and credit markets, they strive to build strong customer relationships based on a track record of effective delivery and a keen focus on doing what is best for the client’s business.

Press contact: Sarah Richards – sarah@eigensystems.com

  • 0 Comments
  • Tags: Eigen.Spaces
17
February

the force is with us!

Posted by Rakesh | In: announcements

nvidia partnerforce
EigenSystems joins NVIDIA Partnerforce! We are readying OpenCL components for Model.Bricks v.2, our suite of parallel analytics components.

Stay tuned for beta announcements!

  • 0 Comments
  • Tags: GPU, Model.Bricks, NVIDIA, OpenCL
27
January

racing stripes

Posted by Rakesh | In: patterns

The first step in migrating a sequential program to parallel code almost always involves identifying opportunities for concurrency. “Almost”, because many useful applications are clearly data-parallel – which is why SPMD (single program, multiple data) grids are so popular.

By adapting a grid-friendly application to many-core parallelism, significant speedup can be gained. We’ll use a common grid application – portfolio VaR – as our first example. For a large number of simulated scenarios, a population of transactions is valued. The values computed for each simulation are sampled in a histogram for analysis. We will pretend that there is a histogram per trade, though, in practice some level of aggregation is common.

The sequential logic (stylized) looks something like:


histogram.init( N_SIM );

for ( i = 0; i < N_SIM; ++i ) {
for ( k = 0; k < N_TRADE; ++k ) {
histogram.sample( k, trade[ k ].value( scenario[ i ] ) );
}
}

Applying map-reduce, it is possible either to partition the set of sample scenarios, or the set of trades to be valued; the typical grid application will execute a sequential process on many cores, with one partitioned subset of data assigned per core to each. When all partitions are processed, a reduction step consolidates the results.

In such implementation, each core acts in isolation. Recent improvements in processor architecture offer further, substantial performance gains resulting from significant size increases in shared, on-chip cache. To realize this gain, logic must be restructured so that the “memory wall” cost is minimized; as the indications below show, main memory access is expensive.

the memory wall

As the table below shows, main memory is plentiful, but many times more expensive to access. The cache subsystem is much faster, but also small. Logic flow must be explicitly designed to take advantage of it. Latencies vary with processor technology, but the relative cost of a main memory access is high.

memory hierarchy: stall penalty v/s size [1] Inside Nehalem
[2] Memory Stall

parallelize this!

A direct parallel implementation using the parallel_for template in Intel TBB is shown below. There is some performance gain relative to a distributed grid, since scenarios need be generated once and can be shared by all threads.

Ordering trades by type will help performance by improving cache residency of pricing algorithm code.


class Simulate {
const ScenarioList & _scenarios;
const TradeList & _trades;
Histogram & _histogram;

public:
Simulate(
const ScenarioList & scenarios,
const TradeList & trades,
Histogram & histogram
) : _scenarios( scenarios ), _trades( trades ), _histogram( histogram ) {}

void operator()( const blocked_range& range ) const {
for( size_t i = range.begin(); i != range.end(); ++i ) {
for( size_t k = 0; k < N_TRADE; k++ )
_histogram.sample( k, _trade[ k ].value( _scenarios[ i ] ) );
}
}
}
};

...
parallel_for( blocked_range( 0, N_SIM ), Simulate( scenarios, histogram ) );

the kernel partitioner

A closer look at the memory access behaviour of the direct parallel_for reveals a high cost in main memory cycles. Since neither the trade not scenario sets are small enough to fit into cache, each trade must be reloaded once for every scenario.

Borrowing a pattern from GPU programming, the EigenSpace::KernelPartition pattern divides the data set into “stripes”. A stripe contains a number of kernels, each of which is bound to a set of trades and a set of scenarios.

Kernels contained in a stripe are executed in parallel, iteration over stripes is sequential. Each stripe references a disjoint set of trades, ensuring that each trade is loaded only once. Scenarios are re-loaded once per stripe.

kernel stripes

Given:

T = total number of trades

S = total number of scenarios

C = number of sockets (cores on a socket can share L3 cache)

Ws (stripe width ) = number of trades per stripe

Ns = number of stripes = ( T / Ws )

memory access performance
with striping O[ ( T + S * Ns / C ) * C ]
without striping (Ns = T) O[ T * S ]
limiting case (Ns = 1, C = 1) O[ T + S ]


class Simulate {
Histogram & _histogram;

public:
Simulate( Histogram & histogram )
: _histogram( histogram ) {}

void operator()( const Scenario & aScenario, const Trade & aTrade ) const {
_histogram.sample( k, aTrade.value( aScenario ) );
}
};

...
typedef EigenSpace::KernelPartition<TradeList, ScenarioList, Simulate> Stripes;

...
Stripes stripes( trades, scenarios );
stripes.execute();

Graphics cards already approach the limiting case and can really light the afterburners, with 128-512 cores and 256M-1G of high-speed VRAM. However, many pricing analytics libraries are large and have deep, intertwined inheritance hierarchies that are not easily expressed as GPU compilable kernels.

(Model.Bricks 2.0 will offer GPU kernels for common computations in finance. Watch this space for announcements!)

  • 0 Comments
  • Tags: concurrency, Eigen.Spaces, GPU, grid, kernel, map-reduce, SPMD

 

September 2010
T F S S M T W
« Aug    
 1
2345678
9101112131415
16171819202122
23242526272829
30  

categories

  • announcements
  • design
  • green computing
  • HFT
  • patterns
  • press room

recent posts

  • the only planet with chocolate
  • The need for speed: the world of High Frequency Trading
  • mind the cache
  • ParaPLoP ‘10
  • “Clothes make the man,” so the saying goes. Developing UI

references

  • [1] the free lunch is over
  • [2] optimizing cache use
  • [3] Computers heat homes
  • [4] data center causes turbulence

tags

blackbird block algorithms carbon footprint concurrency Eigen.Spaces GPU green computing grid high frequency trading kernel locality map-reduce Model.Bricks NVIDIA OpenCL OpenMP parallel computing paraplop SPMD

archive

  • August 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010

© 2010 eigen.systems
Wordpress Themes by (DT)