<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>eigen.systems</title>
	<atom:link href="http://www.eigensystems.com/eigenBlog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://www.eigensystems.com/eigenBlog</link>
	<description>because shift happens</description>
	<lastBuildDate>Sat, 07 Aug 2010 22:35:05 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>the only planet with chocolate</title>
		<link>http://www.eigensystems.com/eigenBlog/?p=478</link>
		<comments>http://www.eigensystems.com/eigenBlog/?p=478#comments</comments>
		<pubDate>Sat, 07 Aug 2010 22:30:00 +0000</pubDate>
		<dc:creator>Sarah Richards</dc:creator>
				<category><![CDATA[green computing]]></category>
		<category><![CDATA[carbon footprint]]></category>
		<category><![CDATA[Eigen.Spaces]]></category>
		<category><![CDATA[Model.Bricks]]></category>

		<guid isPermaLink="false">http://www.eigensystems.com/eigenBlog/?p=478</guid>
		<description><![CDATA[In Helsinki, surplus heat from hundreds of computer servers located in a data center below Uspenski Cathedral will be used to heat hundreds of homes. Full Article
Near Berlin’s Tegel airport, rising heat from a data center is reported to cause turbulence for passing aircraft.
Full Article
The IT industry already has a carbon footprint matching airlines, and [...]]]></description>
			<content:encoded><![CDATA[<p>In Helsinki, surplus heat from hundreds of computer servers located in a data center below Uspenski Cathedral will be used to heat hundreds of homes. <a href="http://business.timesonline.co.uk/tol/business/industry_sectors/natural_resources/article7022488.ece" target="_blank">Full Article</a></p>
<p>Near Berlin’s Tegel airport, rising heat from a data center is reported to cause turbulence for passing aircraft.<a href="http://h41131.www4.hp.com/media2.php/Switzerland/PDF/Unternehmen/Geschaeftliche_Entwicklung/Schaffhauser_Nachrichten.pdf" target="_blank"><br />
Full Article</a></p>
<p>The IT industry already has a carbon footprint matching airlines, and it is not going to get any better anytime soon. The same goes for energy cost for operating and cooling these giant information factories.</p>
<p>In a world where ‘green’ has become king and recycling is more than just the ‘right thing to do’. People are looking under every rock, tree and CPU for ways to ‘save the earth’.</p>
<p>Who would have guessed that IT + Data Centers = high carbon emissions?</p>
<p>Data centers are a necessary evil driving business costs into the stratosphere. The process of real-time information gathering demands energy, causing the current server construct to create an environmental and financial hazard, with seemingly no relief in sight.</p>
<p><strong>Datacenters &#8211; Emissions in mt <em>(Mt = thousands of metric tons)</em></strong></p>
<table>
<tbody>
<tr>
<td style="text-align: justify;"><strong>US datacenters<br />
</strong></td>
<td style="text-align: left;"><em>170 Mt</em></td>
</tr>
<tr>
<td><strong>Argentina</strong></td>
<td><em>142 Mt</em></td>
</tr>
<tr>
<td><strong>Netherlands</strong></td>
<td><em>146 Mt</em></td>
</tr>
<tr>
<td><strong>Malaysia</strong></td>
<td><em>178 Mt</em></td>
</tr>
</tbody>
</table>
<p>A focus on green technology is becoming a main objective for the modern business.</p>
<p>Better, more effective use of many-core and GPU computing is one path to reducing operating cost and environmental impact.</p>
<p>Adoption of GPU acceleration is slowly but surely on increase, with better operating support beginning with Snow Leopard and Windows 7.  There is some technology flux, and some debate about which technology is faster, or easier to program – but the migration to parallel programming is inevitable.</p>
<p>At <strong>eigen.systems</strong> we are doing our part to deliver efficiency and performance. The <strong>eigen.spaces</strong> library makes it easier to modify existing applications to effectively use many-core processors. Model.Bricks provides a suite of framework components that allow selection of either many-core CPU or GPU algorithms for most functions, allowing fine control, best performance and the most efficient use of available hardware.</p>
<p>By providing tools that make it easier for developers to focus on their requirement while leveraging close-to-the-metal parallel programming eigen.systems is helping to promote green computing, generating savings in costs as well as deliver better processing performance.</p>
<p>Save the earth, its the only planet with chocolate!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.eigensystems.com/eigenBlog/?feed=rss2&amp;p=478</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The need for speed: the world of High Frequency Trading</title>
		<link>http://www.eigensystems.com/eigenBlog/?p=470</link>
		<comments>http://www.eigensystems.com/eigenBlog/?p=470#comments</comments>
		<pubDate>Tue, 18 May 2010 20:52:43 +0000</pubDate>
		<dc:creator>Sarah Richards</dc:creator>
				<category><![CDATA[HFT]]></category>
		<category><![CDATA[high frequency trading]]></category>
		<category><![CDATA[parallel computing]]></category>

		<guid isPermaLink="false">http://www.eigensystems.com/eigenBlog/?p=470</guid>
		<description><![CDATA[ Since the unexpected and significant drop in the DOW on May 6th, articles have permeated the net, speculating the cause of the event. 
Working at eigen.systems, my eyes are always peeled for articles that talk about High Frequency Trading (HFT), especially when the significance of computing in this industry is under discussion.
Overall, regardless of [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://media.nzherald.co.nz/webcontent/image/jpg/NYSE_220x14748202.jpg" alt="Traders - AP Photo" /> Since the unexpected and significant drop in the DOW on May 6th, articles have permeated the net, speculating the cause of the event. </p>
<p>Working at eigen.systems, my eyes are always peeled for articles that talk about High Frequency Trading (HFT), especially when the significance of computing in this industry is under discussion.</p>
<p>Overall, regardless of whether persons blame or praise the role of computers in todays age of HFT, one thing is for sure, computers are here to stay. The key questions that abound seem to be how do we make them more accurate, faster and more efficient? </p>
<p>There is a great article in today&#8217;s <a href="http://www.nzherald.co.nz/markets/news/article.cfm?c_id=62&#038;objectid=10645854">New Zealand Herald</a>, which I think both novices and industry professionals do well to read. It explains quite simply what HFT is, how the world of trading has changed over the years with the introduction of faster computation as well as predictions for the future. It&#8217;s a great read, so <a href="http://www.nzherald.co.nz/markets/news/article.cfm?c_id=62&#038;objectid=10645854">check it out</a>!</p>
<p></i>AP photo</i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.eigensystems.com/eigenBlog/?feed=rss2&amp;p=470</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>mind the cache</title>
		<link>http://www.eigensystems.com/eigenBlog/?p=360</link>
		<comments>http://www.eigensystems.com/eigenBlog/?p=360#comments</comments>
		<pubDate>Sun, 25 Apr 2010 21:33:05 +0000</pubDate>
		<dc:creator>Rakesh</dc:creator>
				<category><![CDATA[patterns]]></category>
		<category><![CDATA[block algorithms]]></category>
		<category><![CDATA[locality]]></category>
		<category><![CDATA[OpenMP]]></category>

		<guid isPermaLink="false">http://www.eigensystems.com/eigenBlog/?p=360</guid>
		<description><![CDATA[In a previous post I described iteration space partitioning as one way of improving cache residency of data. How much of a speedup does it deliver, really?
Matrix multiply is a good vehicle to illustrate the memory wall effect – the plots below show performance with increasing matrix dimension / storage layout combinations, for both the [...]]]></description>
			<content:encoded><![CDATA[<p>In a <a href="http://www.eigensystems.com/eigenBlog/?p=84">previous post</a> I described iteration space partitioning as one way of improving cache residency of data. How much of a speedup does it deliver, really?</p>
<p>Matrix multiply is a good vehicle to illustrate the memory wall effect – the plots below show performance with increasing matrix dimension / storage layout combinations, for both the familiar multiplication loop, and a locality optimizing block multiply algorithm.</p>
<p>Being both computation and bandwidth intensive, matrix multiply has performance characteristics similar to many large problems &#8211; VaR and large portfolio simulations in particular. Of course, if it is matrix algebra that you need, an auto (ATLAS) or vendor tuned (MKL et al.) library is best.</p>
<p>Locality improvement (both clever arrangement of data in memory and loop twiddling) can and does yield significant speedup.</p>
<p>The measurements use four storage layouts, for matrix dimensions ranging from 32&#215;32 to 2048&#215;2048. The code and stats used are <a href="http://www.eigensystems.com/eigenBlog/wp-content/uploads/2010/01/latency.tgz">here</a>.</p>
<table border="1" cellspacing="0" cellpadding="2" style="margin-left: 50px;">
<tbody>
<tr>
<th></th>
<th>A-matrix</th>
<th>B-matrix</th>
</tr>
<tr>
<td><code>cc</code></td>
<td>column major</td>
<td>column major</td>
</tr>
<tr>
<td><code>cr</code></td>
<td>column major</td>
<td>row major</td>
</tr>
<tr>
<td><code>rc</code></td>
<td>row major</td>
<td>column major</td>
</tr>
<tr>
<td><code>rr</code></td>
<td>row major</td>
<td>row major</td>
</tr>
</tbody>
</table>
<p>Two algorithms are compared; the familiar, simple three deep loop:</p>
<blockquote><p><code><br />
for( int i = 0; i &lt; N; i++ ) {<br />
<indent_1>for( int j = 0; j &lt; M; j++ ) {</indent_1><br />
<indent_2>sum = 0;</indent_2><br />
<indent_2>for( int k = 0; k &lt; K; k++ )  {</indent_2><br />
<indent_3>sum += A[ i, k ] * B[ k, j ]</indent_2><br />
<indent_2>}</indent_2><br />
<indent_2>C[ i, j ] = sum;</indent_2><br />
<indent_1>}</indent_1><br />
}<br />
</code></p></blockquote>
<p>and the block multiply algorithm below:</p>
<blockquote><p><code><br />
for( int k = 0; k &lt; A.nCols(); k += K_BK ) {<br />
<indent_1>for( int i = 0; i &lt; C.nRows(); i += I_BK ) {</indent_1><br />
<indent_2>for( int j = 0; j &lt; C.nCols(); j += J_BK ) {</indent_2><br />
<indent_3>for( int i1 = i; i1 &lt; min( i + I_BK, C.nRows() ); i1++ ) {</indent_3><br />
<indent_4>for( int j1 = j; j1 &lt; min( j + J_BK, C.nCols() ); j1++ ) {</indent_4><br />
<indent_5>T sum = C[ i1, j1 ];</indent_5><br />
<indent_5>for( int k1 = k; k1 &lt; min( k + K_BK, A.nCols() ); k1++ ) {</indent_5><br />
<indent_6>sum += A[ i, k ] * B[ k, j ];</indent_6><br />
<indent_5>}</indent_5><br />
<indent_5>C[ i1, j1 ] = sum;</indent_5><br />
<indent_4>}</indent_4><br />
<indent_3>}</indent_3><br />
<indent_2>}</indent_2><br />
<indent_1>}</indent_1><br />
}<br />
</code></p></blockquote>
<p>On Xeon 5460 cores (both single and 8-way parallel), simple multiply hits a brick wall once matrix dimension reaches 2048&#215;2048.</p>
<p>Removing just the extreme (2048&#215;2048) measurement, the behavior is still far from the cubic curve you might expect; speed is very sensitive to memory layout (when one or both matrices are accessed with a large stride, the memory controller&#8217;s ability to deliver multiple words per bus cycle is wasted).</p>
<table width="100%">
<tr>
<td><img src="wp-content/uploads/2010/01/nx1.png"  width="85%"/></td>
<tr>
<tr>
<td><img src="wp-content/uploads/2010/01/nx1a.png"  width="85%"/></td>
<tr>
</table>
<p>With the exception of the case when A is row-major and B is column-major, performance with large matrices is unstable. By contrast, the block algorithm is predictable. The next plot compares the best performing case (<code>rc</code>) of the simple algorithm to block matrix performance  &#8211; runtimes for the block algorithm get increasingly better as matrix size grows. At 2K x 2K, better locality delivers a 2.25x speedup.</p>
<p>The block algorithm is relatively insensitive to storage layout. On the flip side, block sizes need tuning for best performance on a given machine.</p>
<table width="100%">
<tr>
<td><img src="wp-content/uploads/2010/01/bx1.png"  width="85%"/></td>
<tr>
</table>
<p>Finally, the parallel comparisons (the examples use OpenMP), plotting MFLOP rates against matrix size confirm the normal intuitions. Knowing the inflection points is useful to make the best choices &#8211; a few loop tweaks and a rearrangements of memory can do wonders for speedup!</p>
<ul>
<li>- Below a certain size, it is best to use the simplest algorithm as the parallel overheads are overwhelming.</li>
<li>- If it is possible, arranging the storage in a cache and memory controller friendly layout pushes the performance envelope of every algorithm upto a point.</li>
<li>- Simply parallelizing the simple algorithm is best for large data sets.</li>
<li>- As simple parallel performance begins to decay, a locality enhancing algorithm will outperform.</li>
</ul>
</table>
<table width="100%">
<tr>
<td><img src="wp-content/uploads/2010/01/cpar.png"  width="85%"/></td>
<tr>
</table>
]]></content:encoded>
			<wfw:commentRss>http://www.eigensystems.com/eigenBlog/?feed=rss2&amp;p=360</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ParaPLoP &#8216;10</title>
		<link>http://www.eigensystems.com/eigenBlog/?p=354</link>
		<comments>http://www.eigensystems.com/eigenBlog/?p=354#comments</comments>
		<pubDate>Fri, 09 Apr 2010 13:42:44 +0000</pubDate>
		<dc:creator>Rakesh</dc:creator>
				<category><![CDATA[announcements]]></category>

		<guid isPermaLink="false">http://www.eigensystems.com/eigenBlog/?p=354</guid>
		<description><![CDATA[Its been a week since I got back from ParaPLoP &#8216;10 workshop &#8211; its was great to meet with a group of people doing so much to bring parallel programming into the mainstream, and it was great learning.
Ade Miller presented the Task Graph pattern from Microsoft TPL. As a TBB user, I find strong parallels [...]]]></description>
			<content:encoded><![CDATA[<p>Its been a week since I got back from ParaPLoP &#8216;10 workshop &#8211; its was great to meet with a group of people doing so much to bring parallel programming into the mainstream, and it was great learning.</p>
<p>Ade Miller presented the Task Graph pattern from Microsoft TPL. As a TBB user, I find strong parallels between them; TPL is elegant, and I am keen to learn more. Arguably, some of it is syntactic sugar, but sugar is sweet! I have resisted Microsoft for many years, but it is time to concede and assimilate.</p>
<p>Just as interesting as the workshops were the side conversations &#8211; It was good also to get a sense of the future (and direction of TBB) from its architect, Arch Robison &#8211; TBB 3.0 will be out soon, and I look forward to it.</p>
<p>Dick Gabriel&#8217;s talk on the works of Chistopher Alexander was fascinating. The architect&#8217;s work has been an inspiration for the design patterns community.</p>
<p>And Ralph Johnson&#8217;s view of programs as  transformations struck a chord &#8211; it is indeed true the most of us revisit the same applications / algorithms over and over again, so making programs parallel is indeed a transformation. Documenting patterns is about sharing best practices, to help the rest of us with that transformation process.</p>
<p>Very interesting for me personally, was hearing from Tim Mattson about the consequences for software developers of what&#8217;s cooking in the silicon furnaces at Intel. I have a feeing things are going to get interesting.</p>
<p>Processors cannot be clocked much faster any more &#8211; power consumption and heat dissipation have seen to that. But the consequences go beyond clock speed limits. Current generations of processors have deep pipelines and out-of-order instruction scheduling, to hide memory access and other internal latency.</p>
<p>Instructions are not executed in the order they are laid out &#8211; if the execution of an instruction is going to be stalled because its operands are not yet available, other instructions are executed instead. Optimized instruction scheduling is done by the hardware at run time.  Quite likely, newer processors will ditch some of this complexity to accommodate more cores.</p>
<p>That leaves a hard optimization job for compilers &#8211; some of these optimizations cannot be statically done.</p>
<p>I take two lessons from this:</p>
<p>1. Where possible, use vendor libraries. MKL, NAG for math, for example. Let the vendors deliver processor optimized versions. I&#8217;ve come across variants of Numerical Recipes code a lot; which will keep application developers on an optimization treadmill.</p>
<p>2. Parallel programming is no longer just for speed junkies &#8211; to maintain current performance, parallelism is going to be needed.</p>
<p>What do you think?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.eigensystems.com/eigenBlog/?feed=rss2&amp;p=354</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;Clothes make the man,&#8221; so the saying goes. Developing UI</title>
		<link>http://www.eigensystems.com/eigenBlog/?p=346</link>
		<comments>http://www.eigensystems.com/eigenBlog/?p=346#comments</comments>
		<pubDate>Thu, 08 Apr 2010 22:04:14 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[design]]></category>

		<guid isPermaLink="false">http://www.eigensystems.com/eigenBlog/?p=346</guid>
		<description><![CDATA[A good user interface makes (or breaks) the software, application, or web site. It’s the difference between the user’s happiness and how far they’ll end up tossing the product out their figurative (or in some cases, actual,) window.
The challenge is to create a user interface that meets both the needs of the business as well [...]]]></description>
			<content:encoded><![CDATA[<p><strong>A good user interface</strong> makes (or breaks) the software, application, or web site. It’s the difference between the user’s happiness and how far they’ll end up tossing the product out their figurative (or in some cases, actual,) window.</p>
<p>The challenge is to create a user interface that meets both the needs of the business as well as the user.  At times this is easier said than done.  Developers and designers have the unenviable task of trying to understand one another enough so that the end result is a usable, workable UI.</p>
<p>A developer may have experience constructing the front-end and back-end of an application but is not versed in design. The reverse is true for designers. Two different languages and thought processes trying to culminate in a cohesive product everyone can love.</p>
<p>To this end, Information and communication is key.  Understanding your user’s mindset is another.</p>
<p>For instance: In today’s plug-n-play mind set, users are rarely inclined to search the ‘help’ section of their application for answers to a specific question (much to the developer’s dismay). Therefore an easy to understand interface, one that teaches your user how to use the application, must be constructed.  Integrating helpful hints, tips and instruction into the application as the user works with the product can be a good way to overcome possible obstacles. Helpful hints with the user option to disable them later is also a good way to go.</p>
<p>Pitfalls can occur when the development team has become caught up in the latest trends, colors, and bells and whistles. This is where the motto – ‘Less is More’ – is a mantra many would do well to tattoo across their computer screen.</p>
<p>Identifying your users will be invaluable. Methods of observation, as well as interviews, can help determine the user’s knowledge of systems and computers in general. This also helps to factor in the user’s background and how this will affect the way they use your product.  What are their jobs?</p>
<p>What tasks does the user frequently conduct and how can your product enhance their workflow?  Analyzing these questions can have a profound effect on your application but it is well worth the effort.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.eigensystems.com/eigenBlog/?feed=rss2&amp;p=346</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eigensystems to present at ParaPLoP 2010!</title>
		<link>http://www.eigensystems.com/eigenBlog/?p=333</link>
		<comments>http://www.eigensystems.com/eigenBlog/?p=333#comments</comments>
		<pubDate>Tue, 16 Mar 2010 18:24:48 +0000</pubDate>
		<dc:creator>Sarah Richards</dc:creator>
				<category><![CDATA[announcements]]></category>
		<category><![CDATA[Eigen.Spaces]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[paraplop]]></category>

		<guid isPermaLink="false">http://www.eigensystems.com/eigenBlog/?p=333</guid>
		<description><![CDATA[From March 30-31, our very own Rakesh Joshi, co-founder of eigen.systems, will be presenting at the 2010 ParaPLoP workshop in Carefree, AZ.
ParaPLoP is an interactive conference where pattern authours present their case studies and share expertise in the field of parallel programming patterns. It is a great arena for pattern professionals and enthusiasts alike to [...]]]></description>
			<content:encoded><![CDATA[<p>From March 30-31, our very own Rakesh Joshi, co-founder of eigen.systems, will be presenting at the <a href="http://www.upcrc.illinois.edu/workshops/paraplop10/index.html">2010 ParaPLoP workshop</a> in Carefree, AZ.</p>
<p>ParaPLoP is an interactive conference where pattern authours present their case studies and share expertise in the field of parallel programming patterns. It is a great arena for pattern professionals and enthusiasts alike to analyze previously published patterns, learn about using patterns to develop parallel software and discover mining patterns from significant parallel code implementations.</p>
<p>Rakesh&#8217;s paper, <strong><em>CONCURRENT EVALUATION OF A DIRECTED ACYLIC GRAPH</em></strong>, was chosen to be presented, (exact day and time TBD). The paper discusses latent parallelism and the benefit and application of concurrency across a variety of platforms. </p>
<p>So if you are in the AZ area, sign up to attend the ParaPLop 2010 conference! For more information about ParaPLop 2010, please visit the official conference site <a href="http://www.upcrc.illinois.edu/workshops/paraplop10/index.html">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.eigensystems.com/eigenBlog/?feed=rss2&amp;p=333</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eigen.Systems launches the eigen.spaces: blackbird application server</title>
		<link>http://www.eigensystems.com/eigenBlog/?p=325</link>
		<comments>http://www.eigensystems.com/eigenBlog/?p=325#comments</comments>
		<pubDate>Tue, 23 Feb 2010 19:25:38 +0000</pubDate>
		<dc:creator>Sameer Tipnis</dc:creator>
				<category><![CDATA[press room]]></category>
		<category><![CDATA[blackbird]]></category>
		<category><![CDATA[Eigen.Spaces]]></category>

		<guid isPermaLink="false">http://www.eigensystems.com/eigenBlog/?p=325</guid>
		<description><![CDATA[PRINCETON, NJ, February 23, 2010 – Eigen.Systems LLC announced today the launch of eigen.spaces:blackbird, an application server for multi-core parallel applications based on the eigen.spaces framework. Eigen.Systems is a software development company specializing in high-performance parallel applications for risk management and high frequency trading. Blackbird is Eigen.Systems fastest, most powerful application server yet; packed with [...]]]></description>
			<content:encoded><![CDATA[<p>PRINCETON, NJ, February 23, 2010 – <strong>Eigen.Systems LLC announced today the launch of eigen.spaces:blackbird</strong>, an application server for multi-core parallel applications based on the eigen.spaces framework. Eigen.Systems is a software development company specializing in high-performance parallel applications for risk management and high frequency trading. Blackbird is Eigen.Systems fastest, most powerful application server yet; packed with features such as many-core grid application management, fast messaging, flexible protocol switching, adaptive load balancing and advanced performance instrumentation.</p>
<p>Eigen.Systems has been providing high speed parallel application development tools since the launch of eigen.spaces in 2008. Since then they have continued to enhance eigen.spaces to support an increasingly rich variety of risk management and trading applications. The release of blackbird builds on the eigen.spaces framework to create an environment where performance is maximized through increased concurrency and cache .</p>
<p>Rakesh Joshi of Eigen.Systems said: “The key to getting the most of many-core lies in understanding application behavior under varying loads. Synchronization costs and latencies can best be understood by sampling instrumentation; by combining low overhead instrumentation with the ability to reallocate resources at run-time, blackbird is a tool to determine the best resource assignment and scheduling strategies to meet performance requirements.”</p>
<p>An Intel Software Partner since 2008, Eigen.Systems have developed high performance parallel frameworks based on Intel Threading Building Blocks (TBB) and the Intel Math Kernel (MKL).<br />
Two years ago Eigen.Systems entered the Intel Software program, utilizing high performance processor optimized threading libraries from Intel (TBB / MKL ) in their products, including blackbird.</p>
<p>Blackbird will be available as a limited trial version as well as a full version for purchase.</p>
<p>About Eigen.Systems<br />
Eigen.Systems was founded in 2006 by a group of investment banking technologists, with over 40 years of combined industry experience in risk management and capital markets. With insight born of long experience in all facets of fixed income, FX, equity and credit markets, they strive to build strong customer relationships based on a track record of effective delivery and a keen focus on doing what is best for the client’s business. </p>
<p>Press contact: Sarah Richards &#8211; <a href="mailto:sarah@eigensystems.com">sarah@eigensystems.com</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.eigensystems.com/eigenBlog/?feed=rss2&amp;p=325</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eigen.Systems launches eigen.spaces v2</title>
		<link>http://www.eigensystems.com/eigenBlog/?p=314</link>
		<comments>http://www.eigensystems.com/eigenBlog/?p=314#comments</comments>
		<pubDate>Tue, 23 Feb 2010 17:43:04 +0000</pubDate>
		<dc:creator>Sarah Richards</dc:creator>
				<category><![CDATA[press room]]></category>
		<category><![CDATA[Eigen.Spaces]]></category>

		<guid isPermaLink="false">http://www.eigensystems.com/eigenBlog/?p=314</guid>
		<description><![CDATA[PRINCETON, NJ, February 23, 2010 – Eigen.Systems LLC announced today the launch of eigen.spaces v 2, a C++ concurrency toolkit multi core development.  Eigenspaces v 2 enhances the functionality and ease of use of eigen.spaces; allowing the user to define object types, specify dependencies and connect data sources with speed and simplicity.
Application patterns such [...]]]></description>
			<content:encoded><![CDATA[<p><strong>PRINCETON, NJ, February 23, 2010 – Eigen.Systems LLC announced today the launch of eigen.spaces v 2</strong>, a C++ concurrency toolkit multi core development.  Eigenspaces v 2 enhances the functionality and ease of use of eigen.spaces; allowing the user to define object types, specify dependencies and connect data sources with speed and simplicity.</p>
<p>Application patterns such as dependency graphs, data parallel applications and request servers allow existing sequential applications to be easily refactored to take advantage of current and upcoming many-core processors.</p>
<p>In conjunction with blackbird, an application server created by Eigen.Systems, multi-core eigen.spaces based applications can be monitored and tuned for best performance under varying workloads.</p>
<p>High performance I/O is delivered using flexible, asynchronous connectors that can be configured to operate with files, databases or various network protocols.</p>
<p>Concurrent patterns adapted to the needs of computational finance allow for existing sequential analytics to be bound to a parallel execution framework, preserving decades of investment in analytics models while tapping the power of many-core computing.</p>
<p>Metadata controlled serialization mechanisms provide compact, binary serialization to maximize network performance while preserving the ability to communicate with external systems using XML, CSV, JSON and other representations.</p>
<p>Rakesh Joshi of Eigen.Systems says “Processor technology has decisively shifted to multi-core, with the result that hardware upgrades can indeed slow down existing sequential applications. With that,  concurrent programming utilizing such tools as eigen.spaces v 2 has become the new imperative if the benefits of new technology are to be realized.”</p>
<p>Evaluation versions of Eigen.spaces v 2 are available upon request. EigenSystems also offer training programs in parallel application development, and consulting services  to accelerate migration to multi-core.</p>
<p>An Intel Software Partner since 2008, Eigen.Systems have developed high performance parallel frameworks based on Intel Threading Building Blocks (TBB) and the Intel Math Kernel (MKL).</p>
<p>About Eigen.Systems<br />
Eigen.Systems was founded in 2006 by a group of investment banking technologists, with over 40 years of combined industry experience in risk management and capital markets. With insight born of long experience in all facets of fixed income, FX, equity and credit markets, they strive to build strong customer relationships based on a track record of effective delivery and a keen focus on doing what is best for the client’s business. </p>
<p>Press contact: Sarah Richards &#8211; <a href="mailto:sarah@eigensystems.com">sarah@eigensystems.com</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.eigensystems.com/eigenBlog/?feed=rss2&amp;p=314</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>the force is with us!</title>
		<link>http://www.eigensystems.com/eigenBlog/?p=298</link>
		<comments>http://www.eigensystems.com/eigenBlog/?p=298#comments</comments>
		<pubDate>Wed, 17 Feb 2010 14:39:05 +0000</pubDate>
		<dc:creator>Rakesh</dc:creator>
				<category><![CDATA[announcements]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Model.Bricks]]></category>
		<category><![CDATA[NVIDIA]]></category>
		<category><![CDATA[OpenCL]]></category>

		<guid isPermaLink="false">http://www.eigensystems.com/eigenBlog/?p=298</guid>
		<description><![CDATA[
EigenSystems joins NVIDIA Partnerforce! We are readying OpenCL components for Model.Bricks v.2, our suite of parallel analytics components.

Stay tuned for beta announcements!
]]></description>
			<content:encoded><![CDATA[<p><img src="wp-content/uploads/2010/01/nvidia_pf.jpg" alt="nvidia partnerforce" width="15%"/><br />
EigenSystems joins NVIDIA Partnerforce! We are readying OpenCL components for Model.Bricks v.2, our suite of parallel analytics components.
</p>
<p>Stay tuned for beta announcements!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.eigensystems.com/eigenBlog/?feed=rss2&amp;p=298</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>racing stripes</title>
		<link>http://www.eigensystems.com/eigenBlog/?p=84</link>
		<comments>http://www.eigensystems.com/eigenBlog/?p=84#comments</comments>
		<pubDate>Tue, 26 Jan 2010 23:08:38 +0000</pubDate>
		<dc:creator>Rakesh</dc:creator>
				<category><![CDATA[patterns]]></category>
		<category><![CDATA[concurrency]]></category>
		<category><![CDATA[Eigen.Spaces]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[grid]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[map-reduce]]></category>
		<category><![CDATA[SPMD]]></category>

		<guid isPermaLink="false">http://www.eigensystems.com/eigenBlog/?p=84</guid>
		<description><![CDATA[The first step in migrating a sequential program to parallel code almost always involves identifying opportunities for concurrency. "Almost", because many useful applications are inherently data-parallel. The popularity of SPMD (single program, multiple data) grids is proof.
]]></description>
			<content:encoded><![CDATA[<p>The first step in migrating a sequential program to parallel code almost always involves identifying opportunities for concurrency. &#8220;Almost&#8221;, because many useful applications are clearly data-parallel &#8211; which is why SPMD (single program, multiple data) grids are so popular.<br />
<br />
By adapting a grid-friendly application to many-core parallelism, significant speedup can be gained. We’ll use a common grid application &#8211; portfolio VaR &#8211; as our first example. For a large number of simulated scenarios, a population of transactions is valued. The values computed for each simulation are sampled in a histogram for analysis. We will pretend that there is a histogram per trade, though, in practice some level of aggregation is common.<br />
<br />
The sequential logic (stylized) looks something like:</p>
<blockquote><p><code><br />
histogram.init( N_SIM );<br />
<br />
for ( i = 0; i &lt; N_SIM; ++i ) {<br />
<indent_1>for ( k = 0; k &lt; N_TRADE; ++k ) {</indent_1><br />
<indent_2>histogram.sample( k, trade[ k ].value( scenario[ i ] ) );</indent_2><br />
<indent_1>}</indent_1><br />
}<br />
</code></p></blockquote>
<p>
Applying map-reduce, it is possible either to partition the set of sample scenarios, or the set of trades to be valued; the typical grid application will execute a sequential process on many cores, with one partitioned subset of data assigned per core to each. When all partitions are processed, a reduction step consolidates the results.<br />
<br />
In such implementation, each core acts in isolation. Recent improvements in processor architecture offer further, substantial performance gains resulting from significant size increases in shared, on-chip cache. To realize this gain, logic must be restructured so that the “memory wall” cost is minimized; as the indications below show, main memory access is expensive.<br />
<br />
<b><i>the memory wall</i></b><br />
<br />
As the table below shows, main memory is plentiful, but many times more expensive to access. The cache subsystem is much faster, but also small. Logic flow must be explicitly designed to take advantage of it. Latencies vary with processor technology, but the relative cost of a main memory access is high.</p>
<table>
<tr>
<td  width="70%">
<img src="wp-content/uploads/2010/01/memory.png" alt="memory hierarchy: stall penalty v/s size" width="90%"/>
</td>
<td align="left"  width="30%">
<a href="http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&#038;p=2">[1] Inside Nehalem</a><br />
<a href="http://software.intel.com/en-us/articles/quantify-memory-stall-penalties-on-64-bit-architecture/">[2] Memory Stall</a>
</td>
</tr>
</table>
<div display="block"></div>
<p><b><i>parallelize this!</i></b><br />
<br />
A direct parallel implementation using the parallel_for template in <a href="http://www.threadingbuildingblocks.org/documentation.php">Intel TBB</a> is shown below. There is some performance gain relative to a distributed grid, since scenarios need be generated once and can be shared by all threads.<br />
<br />
Ordering trades by type will help performance by improving cache residency of pricing algorithm code.</p>
<blockquote><p><code><br />
class Simulate {<br />
<indent_2>const ScenarioList &#038; _scenarios;</indent_2><br />
<indent_2>const TradeList &#038; _trades;</indent_2><br />
<indent_2>Histogram &#038; _histogram;</indent_2><br />
<br />
<indent_1>public:</indent_1><br />
<indent_2>Simulate(</indent_2><br />
<indent_3>const ScenarioList &#038; scenarios,</indent_3><br />
<indent_3>const TradeList &#038; trades,</indent_3><br />
<indent_3>Histogram &#038; histogram</indent_3><br />
<indent_2>) : _scenarios( scenarios ), _trades( trades ), _histogram( histogram ) {}</indent_2><br />
<br />
<indent_2>void operator()( const blocked_range<size_t>&#038; range ) const {</indent_2><br />
<indent_3>for( size_t i = range.begin(); i != range.end(); ++i ) {</indent_3><br />
<indent_4>for( size_t k = 0; k < N_TRADE; k++ )</indent_4><br />
<indent_5>_histogram.sample( k, _trade[ k ].value( _scenarios[ i ] ) );</indent_5><br />
<indent_4>}</indent_4><br />
<indent_3>}</indent_3><br />
<indent_2>}</indent_2><br />
};<br />
<br />
...<br />
parallel_for( blocked_range<size_t>( 0, N_SIM ), Simulate( scenarios, histogram ) );<br />
</code></p></blockquote>
<p>
<b><i>the kernel partitioner</i></b><br />
<br />
A closer look at the memory access behaviour of the direct parallel_for reveals a high cost in main memory cycles. Since neither the trade not scenario sets are small enough to fit into cache, each trade must be reloaded once for every scenario.<br />
<br />
Borrowing a pattern from GPU programming, the <code>EigenSpace::KernelPartition</code> pattern divides the data set into &#8220;stripes&#8221;. A stripe contains a number of kernels, each of which is bound to a set of trades <i>and</i> a set of scenarios.<br />
<br />
Kernels contained in a stripe are executed in parallel, iteration over stripes is sequential. Each stripe references a disjoint set of trades, ensuring that each trade is loaded only once. Scenarios are re-loaded once per stripe.<br />
</p>
<table>
<tr>
<td  width="70%">
<img src="wp-content/uploads/2010/01/stripes.png" alt="kernel stripes" width="90%"/>
</td>
</tr>
</table>
<p>Given:</p>
<p><indent_1>T = total number of trades</indent_1></p>
<p><indent_1>S = total number of scenarios</indent_1></p>
<p><indent_1>C = number of sockets (cores on a socket can share L3 cache)</indent_1></p>
<p><indent_1>Ws (stripe width ) = number of trades per stripe</indent_1></p>
<p><indent_1>Ns = number of stripes = ( T / Ws )</indent_1></p>
<p></p>
<table>
<caption><b><i>memory access performance</i></b></caption>
<tr>
<td>with striping</td>
<td><b>O[ </b>( T + S * Ns / C ) * C<b> ]</b></td>
</tr>
<tr>
<td>without striping (Ns = T)</td>
<td><b>O[ </b>T * S<b> ]</b></td>
</tr>
<tr>
<td>limiting case (Ns = 1, C = 1)</td>
<td><b>O[ </b>T + S<b> ]</b></td>
</tr>
</table>
<blockquote><p><code><br />
class Simulate {<br />
<indent_2>Histogram &#038; _histogram;</indent_2><br />
<br />
<indent_1>public:</indent_1><br />
<indent_2>Simulate( Histogram &#038; histogram )</indent_2><br />
<indent_3>: _histogram( histogram ) {}</indent_3><br />
<br />
<indent_2>void operator()( const Scenario &#038; aScenario, const Trade &#038; aTrade ) const {</indent_2><br />
<indent_3>_histogram.sample( k, aTrade.value( aScenario ) );</indent_3><br />
<indent_2>}</indent_2><br />
};<br />
<br />
...<br />
typedef EigenSpace::KernelPartition&lt;TradeList, ScenarioList, Simulate&gt; Stripes;<br />
<br />
...<br />
Stripes stripes( trades, scenarios );<br />
stripes.execute();<br />
</code></p></blockquote>
<p><i>Graphics cards already approach the limiting case and can really light the afterburners, with 128-512 cores and 256M-1G of high-speed VRAM. However, many pricing analytics libraries are large and have deep, intertwined inheritance hierarchies that are not easily expressed as GPU compilable kernels.</i><br />
<br />
(Model.Bricks 2.0 will offer GPU kernels for common computations in finance. Watch this space for announcements!)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.eigensystems.com/eigenBlog/?feed=rss2&amp;p=84</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
