<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
><channel><title>Matthieu Brucher&#039;s blog &#187; Distributed Computing</title> <atom:link href="http://matt.eifelle.com/category/general/distributed-computing/feed/" rel="self" type="application/rss+xml" /><link>http://matt.eifelle.com</link> <description></description> <lastBuildDate>Tue, 27 Jul 2010 07:04:23 +0000</lastBuildDate> <generator>http://wordpress.org/?v=2.9.1</generator> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <item><title>Optimally use massively parallel clusters resources</title><link>http://matt.eifelle.com/2010/06/15/optimally-use-massively-parallel-clusters-resources/</link> <comments>http://matt.eifelle.com/2010/06/15/optimally-use-massively-parallel-clusters-resources/#comments</comments> <pubDate>Tue, 15 Jun 2010 07:53:10 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[Tools]]></category> <category><![CDATA[Batch scheduling]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=1208</guid> <description><![CDATA[We have now several petaflopic clusters available in the Top500. Of course, we are trying to get the most of their peak computational power, but I think we should sometimes also look at optimal resource allocation.
I&#8217;ve been thinking about this for several months now, for work that has thousands of tasks, each task being massively [...]]]></description> <content:encoded><![CDATA[<p>We have now <a
href="http://www.top500.org/">several petaflopic clusters available in the Top500</a>. Of course, we are trying to get the most of their peak computational power, but I think we should sometimes also look at optimal resource allocation.</p><p>I&#8217;ve been thinking about this for several months now, for work that has thousands of tasks, each task being massively data parallel. Traditionnally, one launches a job through one&#8217;s favorite batch scheduler (favorite or mandatory&#8230;) with fixed resources and during an estimated amount of time. This may work well in research, but in the industrial world, there often a new job that arises and that needs part of your scarce resources. You may have to stop your work, loose your current advances and/or restart the job with less resources. And then the cycle goes on.</p><p><span
id="more-1208"></span></p><h4>Static resource allocation</h4><p>How can resource allocation work? Let&#8217;s start with a simple case where you have 2 applications with different priorities. One of them has a priority of 70 (it&#8217;s supposed to finish in three days) whereas the other one has a priority of 50 (four days left). They share the cluster so that 66% is allocated to the first application and 33% to the second one.<br
/><center><a
href="http://matt.eifelle.com/wp-content/uploads/2010/06/Allocation-2.png"><img
src="http://matt.eifelle.com/wp-content/uploads/2010/06/Allocation-2-300x165.png" alt="" title="Dispatch and allocation of two applications" width="300" height="165" class="aligncenter size-medium wp-image-1241" /></a></center></p><p>What happens if a third application must be launched with a higher priority, because it has to ne finished by tomorrow? You may stop the other two programs, you may loose a lot of work if you didn&#8217;t implement checkpoints (besides, one of them may be an of-the-shelf program you bought yesterday) or suspend it. Either way, this is what you will get:<br
/><center><a
href="http://matt.eifelle.com/wp-content/uploads/2010/06/Allocation-3.png"><img
src="http://matt.eifelle.com/wp-content/uploads/2010/06/Allocation-3-300x165.png" alt="" title="Dispatch and allocation for three applications" width="300" height="165" class="aligncenter size-medium wp-image-1242" /></a></center></p><p>In fact, even if you use dynamic resource allocation, this is what you must get to have your results by the time you need them, but obviously, you have lost your two other applications. Some batch schedulers allow applications to be suspended, but this is a double-edge sword:</p><ul><li>your cluster must support job suspension, and thus have access to drives to save the job state (which is not possible for medium to large-scaled clusters)</li><li>if your application does not scale to your entire cluster (it happens), although one of the other two applications could go on, it is not possible, all processes are put to sleep</li></ul><p>So all things considered, you have to implement dynamic resource allocation.</p><h4>Dynamic resource allocation</h4><p>How does this work? Each application must be aware that it can be allocated more resources or deallocated some at all time. To be portable on all clusters, you cannot suspend part of your program, it must really go away. The batch scheduler must also notice that your application has freed some of its resources. You thus have to allocate small jobs that will communicate together (this can be done with MPI-2).</p><p>This means that you will have hundreds or thousands of small works. All of them will not have to be connected to the scheduler, only one master must be. Of course, this can easilly be done by using a specific queue. Each application on this queue will thus receive orders from the batch scheduler and act upon it. Another advantage is that also the application gets no resource at one point, it still has a saved state that enable the continuation of a run.<br
/><center><a
href="http://matt.eifelle.com/wp-content/uploads/2010/06/Dynamic-workflow.png"><img
src="http://matt.eifelle.com/wp-content/uploads/2010/06/Dynamic-workflow-300x165.png" alt="" title="Dynamic resource allocation workflow" width="300" height="165" class="aligncenter size-medium wp-image-1244" /></a></center></p><p>Of course, this is not easy to do. How can this be applied to an of-the-shelf application? Well, in this case, you may create a bogus application on the master queue that will at least allow other applications to be allocated resources beside it.</p><p>You do not have to implement this on top of MPI. It can be really hard to do (handling data moves between processors, change the decomposition, &#8230;), and you may implement another solution. In my case, I have thousands different tasks that can be run on very few cores, so this is my elementary unit. I don&#8217;t need all tasks to communicate between them, so I create each time brand new independent jobs and I also can tell the scheduler it can kill jobs that are not responding before the next allocation phase.</p><h4>Conclusion</h4><p>To finish, I&#8217;ll say that I know that <a
href="http://www.platform.com/">LSF</a> allows plugins that help dispatch jobs on specific hosts of your cluster (to have the best communication location). There seems to be a way of implementing the needs gathering and the resource assignment, but the documentation is not clear (at all). A specific daemon may be needed. I don&#8217;t know if other batch scheduler allow plugins to modify their behavior, if you know of them and their API, please do tell <img
src='http://matt.eifelle.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /></p>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2010/06/15/optimally-use-massively-parallel-clusters-resources/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Book review: Programming Massively Parallel Processors: A Hands-on Approach</title><link>http://matt.eifelle.com/2010/03/31/book-review-programming-massively-parallel-processors-a-hands-on-approach/</link> <comments>http://matt.eifelle.com/2010/03/31/book-review-programming-massively-parallel-processors-a-hands-on-approach/#comments</comments> <pubDate>Wed, 31 Mar 2010 07:42:28 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[Book review]]></category> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[Morgan Kaufmann]]></category> <category><![CDATA[Tools]]></category> <category><![CDATA[CUDA]]></category> <category><![CDATA[Parallel and Distributed Computing]]></category> <category><![CDATA[Parallel computing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=1161</guid> <description><![CDATA[Massively parallel processors are in the mood today. We had small parallel processors with a few cores and the ability to launch serevral threads on one core, we have now many cores on one processor and at the other end of the spectrum, we have GPUs. CPUs vendors are now going in this direction with [...]]]></description> <content:encoded><![CDATA[<p>Massively parallel processors are in the mood today. We had small parallel processors with a few cores and the ability to launch serevral threads on one core, we have now many cores on one processor and at the other end of the spectrum, we have GPUs. CPUs vendors are now going in this direction with Larabee and Fusion, and GPUs will still have more cores/threads/&#8230; It&#8217;s thus mandatory to understand this shift now.<br
/> <span
id="more-1161"></span></p><h4>Content and opinions</h4><p>First of all, it&#8217;s not a book on programming massively parallel processors, it&#8217;s a book about CUDA. One of the authors is a nVidia fellow, so it&#8217;s no wonder. I think there are three parts in the book: an introduction of CUDA, two examples and then general considerations and the future.</p><p>The first 6 chapters (I don&#8217;t count the first chapter as a real chapter, it&#8217;s more of an introduction to the massively paralell processors and their use in a few pages) are the main CUDA tutorial. I say tutorial because it feel like all beginner courses I&#8217;ve taken in CUDA. The content can be found in all Internet classes, so the only advantage is that you have everything in a book. Nothing less, nothing more.</p><p>I had a feeling of &#8220;deja vu&#8221; for the MRI example, the second was unknown to me. There is not much code, only for the relevant parts, but you won&#8217;t be able to test the different implementations with what is provided in the book. Besides, several times during the writting flow, new techniques are introduced, but one can&#8217;t know what speed-up they provide. Perhaps this is because this speedup cannot be generalized, but still, with proper warnings, the different timings through the GPU port of woth examples would have been great.</p><p>The last part is, as I&#8217;ve said, more general. It starts with a workflow to help parallelizing with GPUs, then an introduction (too short IMHO) of OpenCL and the future of CUDA with Fermi and the SDK 3.0. The workflow chapter is too small. Of course, the goal isn&#8217;t to be like <a
href="http://matt.eifelle.com/2009/12/08/book-review-the-art-of-concurrency-a-thread-monkeys-guide-to-writing-parallel-applications/">The Art of Concurrency</a>, and at least there is a chapter about the process of selecting the algorithm, &#8230; but it is too small. The OpenCL introduction is really an introduction. I&#8217;ve seen one small complete OpenCL call, but that&#8217;s it. I couldn&#8217;t program a single kernel right now. Of course it&#8217;s a CUDA book, not an OpenCL one, but the chapter is useless. Perhaps it would be better to merge it with the &#8220;future&#8221; chapter, as OpenCL is not widely available. Finally, the last chapter states what can be expected of Fermi (really interesting) and of the SDK 3.0.</p><p>What I miss in this book is some explanations of the texture memory. The obvious matrix example uses constant memory for caching the memory accesses. Why isn&#8217;t texture memory used in this example? It&#8217;s far bigger than constant memory and also has a cache, so why not use it? It&#8217;s a CUDA book, but a lot of content is freely available in several tutorials that are sometimes better shaped than the book, so why isn&#8217;t there some special content, like how the cache works? How can you manage grid sizes that are no a power of two? (it&#8217;s explained in one of the example, with zero padding, but there are no protection in the first chapters, which is dangerous) What is coalescing memory and how can I optimize the memory bandwidth with coalescing in mind? (the actual real explanation and appropriate picture is in the last annexe!)</p><h4>Conclusion</h4><p>I don&#8217;t say that the book is not useful, it&#8217;s really interesting as a companion book for a CUDA course or for a beginner. If you&#8217;re used to electronic papers, you will not be interested. If you buy this book, don&#8217;t expect to know everything about CUDA, or even less massively parallel processors. You will have to dig deeper for specific topics, but at least you will have a good basis.</p><div
style="border: 1px solid #000; padding: 5px; margin-bottom: 15px; background: url(http://matt.eifelle.com/wp-content/uploads/2009/12/BN_Logo_3tier.jpg) right bottom no-repeat #ffffff;"> <a
rel="nofollow" href="http://r.popshops.com/pp/78348/programming-massively-parallel-processors-a-hands-on-approach"><img
style="width: 150px;" src="http://images.barnesandnoble.com/images/47190000/47190706.JPG" border="0" alt="Programming Massively Parallel Processors: A Hands-on Approach" /></a><br
/> <a
rel="nofollow" href="http://r.popshops.com/pp/78348/programming-massively-parallel-processors-a-hands-on-approach">Programming Massively Parallel Processors: A Hands-on Approach</a><br
/> Price: $62.95</div><div
class="subcolumns"><div
style="border: 1px solid #000; padding: 5px; margin-bottom: 15px; background: url(http://matt.eifelle.com/wp-content/plugins/amazonsimpleadmin/img/amazon_US_small.gif) right bottom no-repeat #ffffff;"><div
style="width: 60px; float: left; margin-right: 5px;"> <a
href="http://www.amazon.com/exec/obidos/ASIN/0123814723/masbl03-20" target="_blank"><img
src="http://ecx.images-amazon.com/images/I/51VL9FqF6ML._SL75_.jpg" width="60" height="75" border="0" /></a></div><div><p><a
href="http://www.amazon.com/exec/obidos/ASIN/0123814723/masbl03-20" target="_blank">Programming Massively Parallel Processors: A Hands-on Approach</a> (Paperback)<br
/> <span
style="font-size: 0.8em;">by <strong>David B. Kirk, Wen-mei W. Hwu</strong></span><br
/> ISBN: 0123814723</p><p><strong>Price:</strong> <span
style="color: #990000; font-weight: bold;">USD 51.99</span><br
/> <strong>40 used &#038; new</strong> available from <span
style="color: #990000; font-weight: bold;">USD 46.52</span></p><p> <img
src="http://matt.eifelle.com/wp-content/plugins/amazonsimpleadmin/img/stars-3.5.gif" class="asa_rating_stars" /> | 3.5 | 8</div><div
style="clear: both;"></div></div></div>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2010/03/31/book-review-programming-massively-parallel-processors-a-hands-on-approach/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Thinking of good practices when developing with accelerators</title><link>http://matt.eifelle.com/2010/01/05/thinking-of-good-practices-when-developing-with-accelerators/</link> <comments>http://matt.eifelle.com/2010/01/05/thinking-of-good-practices-when-developing-with-accelerators/#comments</comments> <pubDate>Tue, 05 Jan 2010 08:48:57 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[C++]]></category> <category><![CDATA[Design Patterns]]></category> <category><![CDATA[Development process]]></category> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[Tools]]></category> <category><![CDATA[CUDA]]></category> <category><![CDATA[Fortran]]></category> <category><![CDATA[Grid computing]]></category> <category><![CDATA[HMPP]]></category> <category><![CDATA[MPI]]></category> <category><![CDATA[Multithreaded applications]]></category> <category><![CDATA[Scientific computing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=997</guid> <description><![CDATA[Due to the end of the free lunch, manufacturers started to provide differents processing units and developers started to go parallel. It&#8217;s kind of back to the future, as accelerators existed before today (the x87 FPU started as a coprocessor, for instance). If those accelerators were integrated into the CPU, their instruction set were also.
Today&#8217;s [...]]]></description> <content:encoded><![CDATA[<p>Due to the end of the <a
href="http://www.gotw.ca/publications/concurrency-ddj.htm">free lunch</a>, manufacturers started to provide differents processing units and developers started to go parallel. It&#8217;s kind of back to the future, as accelerators existed before today (the x87 FPU started as a coprocessor, for instance). If those accelerators were integrated into the CPU, their instruction set were also.</p><p>Today&#8217;s accelerators are not there yet. The tools are not ready yet (code translators) and usual programming practices may not be adequate. All the ecosystem will evolve, accelerators will change (GPUs are the main trend, but they will be different in a few years), so what you will do today needs to be shaped with these changes in mind. How is it possible to do so? Is it even possible?<br
/> <span
id="more-997"></span></p><h4>Available code translators</h4><p>Code translators are the easiest path to solution. I know two of them.</p><p>The first is the <a
href="http://www.pgroup.com/resources/accel.htm">PGI compiler</a>. It only supports CUDA and the Fortran and C99 language. I didn&#8217;t use it yet, also I plan of testing it in the near future. It is based on pragmas, and the compiler generates the CUDA microcode.</p><p>The second solution is <a
href="http://www.caps-entreprise.com/fr/page/index.php?id=49&amp;p_p=36">HMPP</a>. It supports more than just CUDA (also CAL/IL or OpenCL) and Fortran/C (also Java now). As the PGI compiler, it is based on pragmas, and a excellent thing is that it detects the available accelerators and launches the correct kernel (if you authorized it) or the original code. You can also modify the generated code to put your own (you can tune the code for instance, which may give you an additional x2 factor). Unfortunately, it is not possible to call functions inside the parallelized kernels, which means that only simple or badly-written (too many lines or duplicated code) kernels can be called. I think this is the same for the PGI compiler.</p><p>It seems that code translators still need work:</p><ul><li>only few accelerators are supported (CUDA, and sometimes CAL/IL or OpenCL),</li><li>almost no langage (Fortran/C/Java, a lot of Virtual Machines should be able to use them natively, without developers using specific tools),</li><li>only one function can be parallelized at a time.</li></ul><p>The last point is currently the biggest issue. You need to cut your function int pieces to have clean code and a good portability/evolutivity for the future.</p><p>This is why one still need to program a lot for those accelerators, and so we need to adapt our programming practices, develop in the accelerators&#8217; native langages (even if we know that they may disappear in a few years).</p><h4>Developping your own &#8220;tool chain&#8221; for accelerators</h4><p>For accelerators, there are a lot of things that needs to be done each time: copying some data, computing and getting some data back. These are the steps that code translators automate, in fact it is a common practice to use tools to automate stuff. The issue is that complex kernels are not supported by those translators. So what?</p><p>Creating automatic functions that will copy the data you need is in fact very common in metaprogramming. Coding the kernel on an accelerator is in fact not that difficult: the manufacturers provide the needed compilers (that&#8217;s what nVidia does and the success of the tool chain cannot be denied), and this is really the cornerstone. One has to write more code, some parts are less portable (because they are written in one of the accelerator&#8217;s languages), but in the end, with metaprogramming, the code can be better tuned, enhanced and read. This is the leverage of the accelerators.</p><h4>Conclusion</h4><p>Why do we care developing for accelerators? We know that they will go away. Before they do, they are the only way of speeding up our software. Code translators are the best tools to develop in a portable way, but they need time to support more accelerators, languages and method of programming. When CPUs will be on a par with accelerators, their progress will help compilers to target them correctly. It&#8217;s just a matter of time.<br
/> Meanwhile, metaprogrammin is the next best solution to automate processes that code translators cannot support yet.</p>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2010/01/05/thinking-of-good-practices-when-developing-with-accelerators/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Parallel Studio: Using Advisor Lite</title><link>http://matt.eifelle.com/2009/09/22/parallel-studio-using-advisor-lite/</link> <comments>http://matt.eifelle.com/2009/09/22/parallel-studio-using-advisor-lite/#comments</comments> <pubDate>Tue, 22 Sep 2009 08:02:28 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[C++]]></category> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[Interactive RayTracer]]></category> <category><![CDATA[Profiler]]></category> <category><![CDATA[Tools]]></category> <category><![CDATA[Advisor]]></category> <category><![CDATA[Intel]]></category> <category><![CDATA[Parallel Studio]]></category> <category><![CDATA[Raytracing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=647</guid> <description><![CDATA[After reviewing Parallel Studio, I&#8217;ve decided to look after Advisor Lite. Intel offers it for free, before the actual Advisor is released with a future Parallel Studio version. It aims at steering multithreaded development with Parallel Studio.I&#8217;ve started with the Starting Guide, and in fact, it is the best way to know how to use [...]]]></description> <content:encoded><![CDATA[<p>After reviewing <a
href="http://matt.eifelle.com/2009/07/07/review-of-intel-parallel-studio/">Parallel Studio</a>, I&#8217;ve decided to look after Advisor Lite. Intel offers it for free, before the actual Advisor is released with a future Parallel Studio version. It aims at steering multithreaded development with Parallel Studio.<br
/> <span
id="more-647"></span><br
/> I&#8217;ve started with the Starting Guide, and in fact, it is the best way to know how to use this plugin. Advisor offers four steps, two of them being short-cuts to the online help, and the two others link to some Parallel Studio actions (namely hotspot in Amplifier and the threaded memory check with Inspector).<br
/> The online help is interesting, but once you know how you can parallelize an application and what to look for, the two Parallel Studio actions with the help of some macros presented in the Starting Guide are the only thing you need.</p><h4>Test on parallelizing a custom library</h4><p>I&#8217;ve decided to test Advisor Lite on my <a
href="http://matt.eifelle.com/category/cpp/interactive-raytracer/">Interactive Raytracer</a>. This is a test to verify if Advisor Lite finds the adequate parallelization and the memory sharing issues. It is a simple raytracer, so it can be parallelized for each pixel in the image. The only memory sharing issue that I know of is in the kd-tree ray traversal.</p><h4>Profiling the library</h4><p>First, I will profile the library. For the complete Advisor Lite workflow, I have to use Intel Compiler, and as it is faster than Microsoft&#8217;s compiler, I will use the <strong>timeit_image.py</strong> script instead of the <strong>measure_image.py</strong> I&#8217;ve used when profiling with <a
href="http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/">Valgrind</a> or <a
href="http://matt.eifelle.com/2009/08/18/profiling-with-visual-studio-performance-tool/">Visual Studio</a>.</p><p>Amplifier can show the results in a bottom-up or in a top-down manner. Unfortunately, you only have the exclusive timing that is displayed. In my case, when displaying bottom-up results, the method <strong>getEntryExitDistances()</strong> is the most costly one. In the top-down view, unfortunately, I can&#8217;t have a simple tree, as it can be seen in the following view:</p><p><a
href="http://matt.eifelle.com/wp-content/uploads/2009/08/irt-profile-advisor.png"><img
class="aligncenter size-medium wp-image-727" title="IRT: Amplifier profile (call-tree view)" src="http://matt.eifelle.com/wp-content/uploads/2009/08/irt-profile-advisor-300x187.png" alt="IRT: Amplifier profile (call-tree view)" width="300" height="187" /></a></p><p>In Visual Studio, I have the same results &#8211; more or less -, but with a correct top-down call-tree:</p><p><a
href="http://matt.eifelle.com/wp-content/uploads/2009/08/irt-profile-msvc.png"><img
class="aligncenter size-medium wp-image-695" title="Profile returned by Visual Studio Performance Tool (call-tree)" src="http://matt.eifelle.com/wp-content/uploads/2009/08/irt-profile-msvc-300x187.png" alt="Profile returned by Visual Studio Performance Tool (call-tree)" width="300" height="187" /></a></p><p>The method <strong>getEntryExitDistances()</strong> cannot be parallelized: it is recursively called, several times per pixel, which would lead to a lot of memory contention. The simpler task is thus to parallelized the pixel rendering, a perfect data-parallel problem.</p><h4>Annotation of the code</h4><p>OK, now I can annotate my code. I had to dig inside the help for this, as it was not mentionned in the Starting Guide that Intel provides a header, <strong>annotate.h</strong>, which mimics the issues you may encounter in a multithreaded application.</p><p>So you need to read at least once the online help so that you know the available annotation macros, how you can get them and how they will retrieve what you need. Once the code is annotated, it must be recompiled and then the sharing issues can be detected.</p><h4>Detection of sharing issues</h4><p>As expected, Inspector detected errors in the kd-tree traversal:</p><p><a
href="http://matt.eifelle.com/wp-content/uploads/2009/08/irt-advisor-annotate-correctness.png"><img
class="aligncenter size-medium wp-image-696" title="Memory sharing issues detected by Inspector" src="http://matt.eifelle.com/wp-content/uploads/2009/08/irt-advisor-annotate-correctness-300x187.png" alt="Memory sharing issues detected by Inspector" width="300" height="187" /></a><br
/> The solution in this case is to have a ray-traversal stack per thread, which will have to be implemented in whichever parallel library will be chosen, or simply to put the stack in the actual traversal algorithm and not in the instance.</p><h4>Using TBB</h4><p>I&#8217;ve decided to go for Thread Building Blocks, as it was already used for game development. This seemed to me a good idea, as it is a Open Source solution. So now, I will split the screen in 2D pieces, and add a thread-specific storage in the kd-tree class. Of course, I will have to add a flag to disable this paralellization if TBB is not available.</p><p>The actual parallelization will be in a future post in the Interactive Raytracer category. It is pretty straightforward once I had the different elements Parallel Studio gave me.</p><h4>Conclusion</h4><p>In fact Advisor is mainly the <strong>annotate.h</strong> header, as you have to know your program to put the macros at correct locations. The parallelization must be done by hand, as well as correcting the memory sharing issues.</p><p>The only problem I had is that <strong>annotate.h</strong> includes <strong>window.h</strong>. This header is not C++ compliant and declares some macros as <strong>max()</strong> (in fact I got the same issue with TBB headers!). As I use a <strong>max()</strong> function declared in <strong>std::numerical_limits</strong>,  I had to explicitely undefine this macro.</p><p>Safe from this, Advisor Lite is a good plugin, and I&#8217;m looking forward to seeing Advisor in a next Parallel Studio release.</p>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2009/09/22/parallel-studio-using-advisor-lite/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Review of Intel Parallel Studio</title><link>http://matt.eifelle.com/2009/07/07/review-of-intel-parallel-studio/</link> <comments>http://matt.eifelle.com/2009/07/07/review-of-intel-parallel-studio/#comments</comments> <pubDate>Tue, 07 Jul 2009 08:18:16 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[C++]]></category> <category><![CDATA[Debugger]]></category> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[General]]></category> <category><![CDATA[Tools]]></category> <category><![CDATA[Intel]]></category> <category><![CDATA[Multithreaded applications]]></category> <category><![CDATA[Parallel computing]]></category> <category><![CDATA[Parallel Studio]]></category> <category><![CDATA[Visual Studio]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=606</guid> <description><![CDATA[I&#8217;ve played a little bit with Intel Parallel Studio. Let&#8217;s say it has been a pleasant trip out in the wildness of multithreaded applications.
Intel Parallel Studio is a set of tools geared toward multithreaded applications. It consists of three Visual Studio plugins (so you need a fully-fledged Visual Studio, not an Express edition):Parallel Inspector for [...]]]></description> <content:encoded><![CDATA[<p>I&#8217;ve played a little bit with Intel Parallel Studio. Let&#8217;s say it has been a pleasant trip out in the wildness of multithreaded applications.</p><p>Intel Parallel Studio is a set of tools geared toward multithreaded applications. It consists of three Visual Studio plugins (so you need a fully-fledged Visual Studio, not an Express edition):</p><ul><li>Parallel Inspector for memory analysis</li><li>Parallel Amplifier for thread behavior and concurrency</li><li>Parallel Composer for parallel debugging</li></ul><p>This is an update of the review I&#8217;ve done for the beta version. Since this first review, I&#8217;ve tried the official first version.</p><p><span
id="more-606"></span><br
/> Since the beta phase, Intel added a lot of documentation, online help, as well as additional samples. This was my main complaint at that time, and now, I can say that Intel provides a complete tool with appropriate help. There is still room for improvement, but not much. For instance, here are <a
href="http://software.intel.com/en-us/articles/intel-parallel-studio-features/">videos presenting Parallel Studio</a>.</p><p>There is a simple sample to show how all the plugins can be used simultaneously, the NQueens solution, that is also the main Composer example. For Composer though, different parallelization solutions are proposed. According to the Starting Guide and other documents, Intel&#8217;s workflow consists of using Advisor (I&#8217;ll try to use it in an other post), then Composer to debug the parallelization, Inspector to check for contentions, &#8230; and then Inspector to profile the application. One of the videos is dedicated to showing how to use the plugins with the NQueens sample.</p><p>As a final point, each plugin has a specific, parametrable toolbar, with a distinct icon.</p><h4>Parallel Composer</h4><p><a
href="http://software.intel.com/en-us/articles/intel-parallel-composer-features/">Parallel Composer</a> is mainly an parallel extension to Visual Studio&#8217;s debugger. It is based on an Intel runtime, which means you have to use Intel C++ Compiler, which is provided, as well as IPP (a primitives library) and TBB (a parallel library), but not MKL, the scientific library. The 11.1 version of the compiler provides OpenMP 3.0 (Visual Studio compiler only provides 2.5) and thus task parallelism. Intel&#8217;s goal is to provide this to C developers (C++ programmers can use TBB, for instance).</p><p>The goal of the extension is to detect shared data and its implication on reentrancy (can this function be simultaneously called by different threads ?) or the task and thread tree with OpenMP.</p><p><center><div
id="attachment_619" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/composer-debugger.png"><img
class="size-medium wp-image-619" title="Parallel Composer: the additional options during debugging" src="http://matt.eifelle.com/wp-content/uploads/2009/06/composer-debugger-300x187.png" alt="Parallel Composer: The additional options during debugging" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Composer: The additional options during debugging</p></div></center></p><p>The OpenMP panels are not only for OpenMP. They are for every extension that needs <strong>/qopenmp</strong> (for instance for the parallel extension like <strong>__par</strong>), in which case useful information is displayed for the state of existing threads. It is also possible to suppress the multithreading and use a monothread execution.</p><p><center><div
id="attachment_620" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/composer-debugger-views.png"><img
class="size-medium wp-image-620" title="Parallel Composer: task and thread views" src="http://matt.eifelle.com/wp-content/uploads/2009/06/composer-debugger-views-300x187.png" alt="Parallel Composer: Task and thread views" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Composer: Task and thread views</p></div></center></p><p>It seems that it is possible to debug several process simultaneously, like TotalView does, but there are no example and no tutorial to explain how to do this.</p><p>Parallel Composer is a powerfull debugger extension, with a lot of information that you can get. On one hand, Intel did also a good job to provide tutorials and an online help. On the other hand, the documentation for the most important plugin is perhaps the shortest compared to the two other ones.</p><h4>Parallel Inspector</h4><p><a
href="http://software.intel.com/en-us/articles/intel-parallel-inspector-features/">Parallel Inspector </a>is in charge of detecting general memory issues as well as thread memory issues. Depending on the inspection level, the execution time can be several times longer. Each time a problem is detecting, it is assigned a gravity degree and registered in a list where you can then have access to its location and the source code.</p><p>The first analysis is the general memory one. It detects, for instance, memory leaks. Here is a result that it can give:</p><table><tbody><tr><td
align="center"><div
id="attachment_621" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/inspector-memory.png"><img
class="size-medium wp-image-621" title="Parallel Inspector: Memory report" src="http://matt.eifelle.com/wp-content/uploads/2009/06/inspector-memory-300x187.png" alt="Parallel Inspector: memory report" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Inspector: Memory report</p></div></td><td
align="center"><p><div
id="attachment_622" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/inspector-memory-source.png"><img
class="size-medium wp-image-622" title="Parallel Inspector: location of a memory leak" src="http://matt.eifelle.com/wp-content/uploads/2009/06/inspector-memory-source-300x187.png" alt="Parallel Inspector: Location of a memory leak" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Inspector: Location of a memory leak</p></div></td></tr></tbody></table><p>Usually, this kind of detection needs to modify your code, or with Linux, you have to preload a library that will detect memory leaks (or use valgrind). Here, the really great point is that there are no modification to do on the code and you can use the compiler of your choice.</p><p>The real addition of Inspector is of course not checking for memory leaks. Parallel Inspector is not titled &#8220;Parallel&#8221; for nothing. It can check concurrent memory accesses, and thus warn the developers that some threads can read or write concurrently. Of course, once you&#8217;ve checked the access is not dangerous, you can indicate Inspector to skip it (so the inspection is faster next time).</p><table><tbody><tr><td
align="center"><p><div
id="attachment_623" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/inspector-threads.png"><img
class="size-medium wp-image-623" title="Parallel Inspector: concurrent memory access" src="http://matt.eifelle.com/wp-content/uploads/2009/06/inspector-threads-300x187.png" alt="Parallel Inspector: Concurrent memory access" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Inspector: Concurrent memory access</p></div></td><td
align="center"><p><div
id="attachment_624" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/inspector-threads-source.png"><img
class="size-medium wp-image-624" title="Parallel Inspector: Source code of a memory access" src="http://matt.eifelle.com/wp-content/uploads/2009/06/inspector-threads-source-300x187.png" alt="Parallel Inspector: Source code of a memory access" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Inspector: source code of a memory access</p></div></td></tr></tbody></table><p>Inspector is, in my opinion, the easiest-to-use plugin of Parallel Studio. I find it easy to use because memory checks is something developers always care, so we know what to expect from it.</p><h4>Parallel Amplifier</h4><p><a
href="http://software.intel.com/en-us/articles/intel-parallel-amplifier-features/">Parallel Amplifier</a> is a profiler (I don&#8217;t know if it is instrumentation- or sampling-based) like the one you can found in Visual Studio Team edition, or like VTune, the fully-fledged profiler Intel sells as a stand-alone product. Here, you can only get the execution time, but it is still valuable information (if you need more, go and get VTune or Visual Studio Team). Then, for the Parallel profile, you can get the concurrency quality as well as waiting time.</p><p>Hotspot is the first profile you can get. The goal is to find where the application sends most of its time, which is in fact called the &#8220;hotspot&#8221;. In the next example, it is <strong>algorithm2</strong>, and by double-clicking on it, an annotated source code is displayed.</p><table><tbody><tr><td
align="center"><p><div
id="attachment_625" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-hotspot.png"><img
class="size-medium wp-image-625" title="Parallel Amplifier: Hotspot profil" src="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-hotspot-300x187.png" alt="Parallel Amplifier: hotspot profil" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Amplifier: Hotspot profil</p></div></td><td
align="center"><p><div
id="attachment_627" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-hotspot-source.png"><img
class="size-medium wp-image-627" title="Parallel Studio: Hotspot annotated source code" src="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-hotspot-source-300x187.png" alt="Parallel Studio: Hotspot annotated source code" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Studio: Hotspot annotated source code</p></div></td></tr></tbody></table><p>How scalable is my program? This is what the second profile tries to answer to. In this case, the scalability is given in the panel at the lower right of the screen (here, for two processors, I get 1.57, which means 78.4% of use, or efficiency). Source code can then be displayed with the annotations, here the lack of concurrency comes from the display routines. On the other hand, <strong>algorithm2</strong> scales well. To optimize your concurrency, what you need is to reduce the red/&#8221;poor&#8221; part of the bar, and maximize the other ones.</p><table><tbody><tr><td
align="center"><p><div
id="attachment_628" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-concurrency.png"><img
class="size-medium wp-image-628" title="Parallel Amplifier: Concurrency" src="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-concurrency-300x187.png" alt="Parallel Amplifier: concurrency" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Amplifier: Concurrency</p></div></td><td
align="center"><p><div
id="attachment_629" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-concurrency-source.png"><img
class="size-medium wp-image-629" title="Parallel Amplifier: Concurrency annoted source code" src="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-concurrency-source-300x187.png" alt="Parallel Amplifier: Concurrency annoted source code" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Amplifier: Concurrency annoted source code</p></div></td></tr></tbody></table><p>Finally, a crucial issue is waiting and locks. Here again, Amplifier has a specific profile. Here, the main thread only waits for the subthreads to return.</p><table><tbody><tr><td
align="center"><p><div
id="attachment_630" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-locks-waits.png"><img
class="size-medium wp-image-630" title="Parallel Amplifier: Waits and locks" src="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-locks-waits-300x187.png" alt="Parallel Amplifier: Waits and locks" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Amplifier: Waits and locks</p></div></td><td
align="center"><p><div
id="attachment_631" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-locks-waits-source.png"><img
class="size-medium wp-image-631" title="Parallel Amplifier: Annotated source code for waits and locks" src="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-locks-waits-source-300x187.png" alt="Parallel Amplifier: Annotated source code for waits and locks" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Amplifier: Annotated source code for waits and locks</p></div></td></tr></tbody></table><p>Profiling should be done anytime, and it is interesting to see whether one optimization enhances the program or not. Amplifier can help you do this.</p><p><div
id="attachment_626" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-hotspot-comparison.png"><img
class="size-medium wp-image-626" title="Parallel Amplifier: Profils comparison" src="http://matt.eifelle.com/wp-content/uploads/2009/06/amplifier-hotspot-comparison-300x187.png" alt="Parallel Amplifier: Profils comparison" width="300" height="187" /></a><p
class="wp-caption-text">Parallel Amplifier: Profils comparison</p></div><p>Amplifier comes with several examples, and a good online help. It is not meant to be a full guide to optimization (there are complete books dedicated to this topic), but it gives you access to the tools you need and some leads to use them correctly.</p><h4>Conclusion</h4><p>If Amplifier and Inspector are intuitive and simple to use, it is perhaps not the same for Composer. Intel provides several videos as tutorials to help you use all the plugins, as well as complete guides and samples. Parallel Composer is perhaps less documented, but it is mainly more complicated to use, at least from my point of view.</p><p>This product is very helpful, in my opinion, not code intrusive (I&#8217;m thinking about Amplifier and Inspector for detecting issues without additional libraries) and efficient. The tackled issues are not easy ones to solve, and it does it brilliantly. Since the beta phase, Intel did a tremondous job at providing better documentation for its tool, and now it is the best tool for multithreaded development.</p><p>Dr. Dobbs publish some days ago <a
href="http://www.ddj.com/hpc-high-performance-computing/218101819">a small post</a> on what is needed for multithreaded application development, and it said Parallel Studio is the perfect tool to help this.</p><div
style="border: 1px solid #000; padding: 5px; margin-bottom: 15px; background: #ffffff;"> <a
rel="nofollow" href="http://r.popshops.com/pp/70081/intel-parallel-studio"><img
style="width: 150px;" src="http://service.pcconnection.com/images/inhouse/9524417.jpg" border="0" alt="Intel Parallel Studio" /></a><br
/> <a
rel="nofollow" href="http://r.popshops.com/pp/70081/intel-parallel-studio">Intel Parallel Studio</a><br
/> Price: $799.95<br
/> Designed for today s serial applications and tomorrow&#8217;s software innovators</div>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2009/07/07/review-of-intel-parallel-studio/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Review of Intel Parallel Studio (beta)</title><link>http://matt.eifelle.com/2009/06/09/review-of-intel-parallel-studio-beta/</link> <comments>http://matt.eifelle.com/2009/06/09/review-of-intel-parallel-studio-beta/#comments</comments> <pubDate>Tue, 09 Jun 2009 08:33:01 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[C++]]></category> <category><![CDATA[Debugger]]></category> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[Tools]]></category> <category><![CDATA[Intel]]></category> <category><![CDATA[Multithreaded applications]]></category> <category><![CDATA[Parallel computing]]></category> <category><![CDATA[Parallel Studio]]></category> <category><![CDATA[Visual Studio]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=502</guid> <description><![CDATA[Since this post, Intel has officially released Parallel Studio. This is why I&#8217;ve published a new, up-to-date review here.]]></description> <content:encoded><![CDATA[<p>Since this post, Intel has officially released Parallel Studio. This is why I&#8217;ve published <a
href="http://matt.eifelle.com/2009/07/07/review-of-intel-parallel-studio/">a new, up-to-date review here</a>.</p>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2009/06/09/review-of-intel-parallel-studio-beta/feed/</wfw:commentRss> <slash:comments>1</slash:comments> </item> <item><title>Overview of TotalView, a parallel debugger</title><link>http://matt.eifelle.com/2009/03/31/overview-of-totalview-a-parallel-debugger/</link> <comments>http://matt.eifelle.com/2009/03/31/overview-of-totalview-a-parallel-debugger/#comments</comments> <pubDate>Tue, 31 Mar 2009 08:15:12 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[C++]]></category> <category><![CDATA[Debugger]]></category> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[Fortran]]></category> <category><![CDATA[MPI]]></category> <category><![CDATA[Multithreaded applications]]></category> <category><![CDATA[OpenMP]]></category> <category><![CDATA[Parallel computing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=265</guid> <description><![CDATA[Some months ago, I had a TotalView tutorial, thanks to my job. Now, I&#8217;ve actually used it to debug one of my parallel applications and I would like to share my experience with fantastic tool.
First TotalView is not only a parallel debugger available on several Linux and Unix platforms. It also is a memory checker [...]]]></description> <content:encoded><![CDATA[<p>Some months ago, I had a TotalView tutorial, thanks to my job. Now, I&#8217;ve actually used it to debug one of my parallel applications and I would like to share my experience with fantastic tool.<br
/> First TotalView is not only a parallel debugger available on several Linux and Unix platforms. It also is a memory checker (MemoryScape and the TotalView plugin) as well as a reverse debugger, that is, you can roll back the execution of a program, even after it crashed (where it would be useless with a standard debugger like GDB).<br
/> <span
id="more-265"></span></p><h4>TotalView</h4><p>Inside the main TotalView window, each program with its threads and processes can be accessed, reopened, even if you closed the application window. The only drawback is that it is not possible to remove an application from this window&#8230;</p><div
id="attachment_343" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_main.png"><img
class="size-medium wp-image-343" title="totalview_main" src="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_main-300x158.png" alt="TotalView main window" width="300" height="158" /></a><p
class="wp-caption-text">TotalView main window</p></div><p>Launching Totalview raises a window allowing to launch a new program, attach to a running one or analyze a core dump. If the application uses MPI, it must be indicated (several implementations are available).</p><div
id="attachment_344" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_open.png"><img
class="size-medium wp-image-344" title="totalview_open" src="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_open-300x130.png" alt="TotalView open a program windows" width="300" height="130" /></a><p
class="wp-caption-text">TotalView open a program windows</p></div><p>Once the application is launched, it is possible to actually debug it. The interface shows which process and thread is currently selected (the list of processes and threads is available in the lowest tab window). Unfortunately, there is no way to browse the code, so you have to go through your code (you can &#8220;dive&#8221; into a function by double-clicking on a call) to put a breakpoint somewhere.</p><p>For TotalView, breakpoints are a special case of action points. On action points, you can stop the program, or execute a simple code. You can also tell Totalview to stop the program when the program went a specific number of time through an instruction (efficient when the error shows up at the hundredth-or-so iteration of a loop).</p><p>There are several ways of stopping when arriving at an action point: stopping as soon as one thread/process arrives, when all arrived at it, a group, &#8230; There also a lot of other functions that are quite usefull.</p><div
id="attachment_345" class="wp-caption aligncenter" style="width: 245px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_start.png"><img
class="size-medium wp-image-345" title="totalview_start" src="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_start-235x300.png" alt="TotalView debugging window" width="235" height="300" /></a><p
class="wp-caption-text">TotalView debugging window</p></div><p>Exploring variables is one of the obvious uses of a debugger. Without it, debugging is often useless. TotalView allows to &#8220;dive&#8221; into a variable, and then explore it. A multi-dimensional variable can be sliced, and then compared between processes. When a variable is modified, it appears in yellow. It is then possible to compare an MPI communication result (for instance).</p><p>When comparing to other parallel debugger (like DDT), the array display is not as beautiful. TotalView has other advantages, as having its own C/C++/Fortran debugger, without relying on gdb.</p><div
id="attachment_346" class="wp-caption aligncenter" style="width: 260px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_variables.png"><img
class="size-medium wp-image-346" title="totalview_variables" src="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_variables-250x300.png" alt="TotalView variable display window" width="250" height="300" /></a><p
class="wp-caption-text">TotalView variable display window</p></div><h4>MemoryScape</h4><p>MemoryScape is TotalView&#8217;s memory tool. It captures OS memory calls and watches what the application does.</p><p>The first option is to quard memory blocks. It&#8217;s less efficient that Fortran&#8217;s bound checks, but it is less costly (as the memory guards are only checked when the program stops). Other options include paint blocks (a pattern is &#8220;painted&#8221; inside the block, and if it shows up somewhere else in the code, it&#8217;s that the block wasn&#8217;t worrectly initializd, for instance), hoarded memory (deallocated memory is not immediatelly freed, which can then lead to detect memory corruption) and of course leak detection.</p><p>Several graphs can be drawn, but some are misleading (as the memory pie, which does not show the truth).</p><h4>ReplayEngine</h4><p>Replay Engine is a reverse debugger. When the program crashed, it is possible to rewind the execution to find where the problem first showed up.</p><p>Of course, the rewind option is based on snapshots, which means that you cannot replay a really big program (that uses several GB), that ReplayEngine chooses when to do a snapshot, and it is possible that the instant you want was not captured. I never used the ReplayEngine because of these pitfalls (no reverse debugger can escape them).</p><h4>Conclusion</h4><p>Although it is pretty much expensive, TotalView is very helpfull. When I had to parallize with MPI a scientific code, it was simple to use the MPI library I used, and the variable display helped me fix the communications in no time.</p><p>I never had a real use for MemoryScape. The leak detection is efficient, but like Valgrind, some detected leaks are not real leaks. The guarded memory could have been useful, but as I had read issues, it couldn&#8217;t help me.</p><p>In the end, I would recommand TotalView as a parallel debugger. With an efficient parallel profiler, it is one of the need-to-have tools in one&#8217;s toolbox.</p><p>Link to the official TotalView website: <a
href="http://totalviewtech.com/">http://totalviewtech.com/</a></p><form
action="https://www.paypal.com/cgi-bin/webscr" method="post"> <input
type="hidden" name="cmd" value="_xclick" /> <input
type="hidden" name="business" value="matthieu.brucher@gmail.com" /><input
type="hidden" name="item_name" value="Buy Me a Coffee!" /><input
type="hidden" name="currency_code" value="USD" /><span
style="font-size:10.0pt"><strong> Buy Me a Coffee!</strong></span><br
/><br
/><select
id="amount" name="amount" class=""><option
value="3">Capuccino - 3$</option><option
value="6">Frappuccino - 6$</option><option
value="10">Hot Chocolate - 10$</option><option
value="20">Expensive Coffee - 20$</option><option
value="50">Alien Coffee - 50$</option></select><br
/><br
/><strong>Other Amount:</strong><br
/><br
/><input
type="text" name="amount" size="10" title="Other donate" value="" /><br
/><br
/><strong> Your Email Address :</strong><input
type="hidden" name="on0" value="Reference" /><br
/><br
/><input
type="text" name="os0" maxlength="60" /> <br
/><br
/> <input
type="hidden" name="no_shipping" value="2" /> <input
type="hidden" name="no_note" value="1" /> <input
type="hidden" name="mrb" value="3FWGC6LFTMTUG" /> <input
type="hidden" name="bn" value="IC_Sample" /> <input
type="hidden" name="return" value="http://matt.eifelle.com" /><input
type="image" src="https://www.paypal.com/en_US/i/btn/x-click-but11.gif" name="submit" alt="Make payments with payPal - it's fast, free and secure!" /></form>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2009/03/31/overview-of-totalview-a-parallel-debugger/feed/</wfw:commentRss> <slash:comments>1</slash:comments> </item> <item><title>Book review: Patterns for Parallel Programming</title><link>http://matt.eifelle.com/2009/03/10/book-review-patterns-for-parallel-programming/</link> <comments>http://matt.eifelle.com/2009/03/10/book-review-patterns-for-parallel-programming/#comments</comments> <pubDate>Tue, 10 Mar 2009 08:32:38 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[Book review]]></category> <category><![CDATA[Design Patterns]]></category> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[Design pattern]]></category> <category><![CDATA[Parallel computing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=202</guid> <description><![CDATA[Like other programming models, there are some patterns in selecting the right parallel solution when it comes to designing a parallel application. This is what this book is about. The solutions may be obvious, but patterns aften are.Content and opinions
The global content of the book is nothing new. What is really great is the progression [...]]]></description> <content:encoded><![CDATA[<p>Like other programming models, there are some patterns in selecting the right parallel solution when it comes to designing a parallel application. This is what this book is about. The solutions may be obvious, but patterns aften are.<br
/> <span
id="more-202"></span></p><h4>Content and opinions</h4><p>The global content of the book is nothing new. What is really great is the progression thoughout the book (the table of contents is perhaps obvious but it is easy to miswrite it). The decision one has to make to write a parallel application really follow the book&#8217;s flow.</p><p>Before the actual patterns, two chapters are spent explaining what is meant by parallel programming and the associated patterns. The implication on the OS and the environment are clear and simple. There isn&#8217;t much to say about these chapters, as it is pretty much basic knownledge.</p><p>The first interesting chapter is the third one. It starts the pattern show with how to achieve concurrency in one&#8217;s application. Data, task parallel? Then, how do data and task interact? Once this is done, the fourth chapter comes in. How can one use the preceding findings inside one&#8217;s algorithm? Pipeline, divide and conquer, &#8230; the usual solutions are thoroughly explained.</p><p>Once the interactions are set, they can be used to choose the best tool: one program or several (SPMD or MPMD)? Master/Worker, fork/join, &#8230; There are several ways of setting up the application to use the patterns of chapter four (what is called the structure space inside the book). Then, the final chapter is dedicated to the implementation of those structures. Threads or processes? It also depends on the parallel tool that one can use (MPI or OpenMP, for instance), and also how they can interact through synchronization and communication.</p><p>At this point, the last step is achived and the program may be written without additional thought about how parallelism can be achieved.</p><h4>Conclusion</h4><p>Of course, since the book was first published, additional parallel patterns were &#8220;found&#8221; (more exactly described), but the ones from the book are the most used ones. Nothing impedes you from using additional ones insode your own parallel workflow.</p><p>Ralph Johnson (from the Gang of Four) gave a talk about parallel patterns, and this book. It can be found there: http://media.cs.uiuc.edu/Apresos/seminars/UPCRC/2008-09-19/UPCRC__2008-09-19_02-58-PM_files/flash_index.htm Now that parallelism is the only way to go to really speed up programs, I hope the book will become more popular.</p><p>If you don&#8217;t like to read and if you have some sense of fun, you can view this <a
href="http://software.intel.com/en-us/videos/a-visual-guide-to-key-concepts-in-threaded-programming-Common-problems-and-how-to-solve-them">Intel video</a>. Enjoy <img
src='http://matt.eifelle.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /></p><div
style="border: 1px solid #000; padding: 5px; margin-bottom: 15px; background: url(http://matt.eifelle.com/wp-content/uploads/2009/12/BN_Logo_3tier.jpg) right bottom no-repeat #ffffff;"> <a
rel="nofollow" href="http://r.popshops.com/pp/69413/patterns-for-parallel-programming-software-patterns-series"><img
style="width: 150px;" src="http://images.barnesandnoble.com/images/14740000/14748847.JPG" border="0" alt="Patterns for Parallel Programming(Software Patterns Series)" /></a><br
/> <a
rel="nofollow" href="http://r.popshops.com/pp/69413/patterns-for-parallel-programming-software-patterns-series">Patterns for Parallel Programming(Software Patterns Series)</a><br
/> Price: $38.7</div><div
style="border: 1px solid #000; padding: 5px; margin-bottom: 15px; background: url(http://matt.eifelle.com/wp-content/uploads/2009/12/BN_Logo_3tier.jpg) right bottom no-repeat #ffffff;"> <a
rel="nofollow" href="http://r.popshops.com/pp/69414/patterns-for-parallel-programming-software-patterns-series"><img
style="width: 150px;" src="http://images.barnesandnoble.com/images/46510000/46517837.PNG" border="0" alt="Patterns for Parallel Programming(Software Patterns Series)" /></a><br
/> <a
rel="nofollow" href="http://r.popshops.com/pp/69414/patterns-for-parallel-programming-software-patterns-series">Patterns for Parallel Programming(Software Patterns Series)</a><br
/> Price: $30.96</div><div
class="subcolumns"><div
style="border: 1px solid #000; padding: 5px; margin-bottom: 15px; background: url(http://matt.eifelle.com/wp-content/plugins/amazonsimpleadmin/img/amazon_US_small.gif) right bottom no-repeat #ffffff;"><div
style="width: 57px; float: left; margin-right: 5px;"> <a
href="http://www.amazon.com/exec/obidos/ASIN/0321228111/masbl03-20" target="_blank"><img
src="http://ecx.images-amazon.com/images/I/519g-mobkzL._SL75_.jpg" width="57" height="75" border="0" /></a></div><div><p><a
href="http://www.amazon.com/exec/obidos/ASIN/0321228111/masbl03-20" target="_blank">Patterns for Parallel Programming</a> (Hardcover)<br
/> <span
style="font-size: 0.8em;">by <strong>Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill</strong></span><br
/> ISBN: 0321228111</p><p><strong>Price:</strong> <span
style="color: #990000; font-weight: bold;">USD 52.29</span><br
/> <strong>42 used &#038; new</strong> available from <span
style="color: #990000; font-weight: bold;">USD 30.00</span></p><p> <img
src="http://matt.eifelle.com/wp-content/plugins/amazonsimpleadmin/img/stars-3.5.gif" class="asa_rating_stars" /> | 3.5 | 6</div><div
style="clear: both;"></div></div></div>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2009/03/10/book-review-patterns-for-parallel-programming/feed/</wfw:commentRss> <slash:comments>1</slash:comments> </item> <item><title>Book review: Parallel Finite-Difference Time-Domain Method</title><link>http://matt.eifelle.com/2009/02/03/book-review-parallel-finite-difference-time-domain-method/</link> <comments>http://matt.eifelle.com/2009/02/03/book-review-parallel-finite-difference-time-domain-method/#comments</comments> <pubDate>Tue, 03 Feb 2009 08:08:55 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[Artech House Publishers]]></category> <category><![CDATA[Book review]]></category> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[Fortran]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[MPI]]></category> <category><![CDATA[Parallel computing]]></category> <category><![CDATA[Scientific computing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=326</guid> <description><![CDATA[I came across the issue of how to teach a trainee how to write a parallel finite-difference time-domain (FDTD) method. There are a lot of books on the FDTD, but only a few on parallel ones. So I&#8217;ve decided to go for this book, knowing that some chapters won&#8217;t apply to our job (wave equations). [...]]]></description> <content:encoded><![CDATA[<p>I came across the issue of how to teach a trainee how to write a parallel finite-difference time-domain (FDTD) method. There are a lot of books on the FDTD, but only a few on parallel ones. So I&#8217;ve decided to go for this book, knowing that some chapters won&#8217;t apply to our job (wave equations). My goal was to seek a book that would explain the basics of my issues.<br
/> <span
id="more-326"></span></p><h4>Content and opinions</h4><p>The book can be split in two parts: the first is about the electromagnetic equations, the second on its parallel implementation.</p><p>The first chapter deals with the basics of FDTD. Stability analysis, dispersion, &#8230; Nothing fancy, but it does its work. Then different kind of boundary conditions are presented in the second chapter. A lot are specific to electromagnetism, but the one I use (CPML) is also part of this one.</p><p>The three next chapters deal mainly with electromagnetism specifics, so I didn&#8217;t read them much. The third is about some FDTD optimizations, the fourth introduces the different source solutions, mainly electromag-specific. The last chapter on FDTD is about data collection and what can be computed from them. Some information are worth readign, as we can forget, for instance, that an FDTD computation does not output results on the same grid (electric field and magnetic field are interlaced).</p><p>The five last chapters are dedicated to the parallel FDTD. After a parallel system introduction (not outstanding, but it present the architecture, the different techniques, how speedup is computed, &#8230;), the actual FDTD method is dissected through the different exchange techniques (although I do not use one of them, but the differences and implications of each are correctly described) and the actual exchange code (independent of the technique used). Drawings explain what must be communicated, and the MPI code is given (in fact, it&#8217;s an extract of the program given in the beginning of the chapter). Then other electromagnetism topics are addressed.</p><p>The eighth chapter presents some results and finally the last two chapters FDTD when using a polar representation (which is not my case, so I&#8217;ve skipped them).</p><h4>Conclusion</h4><p>As usual for such a book, you have to make your firm buy it for you. At 100$ the 260-pages book, it&#8217;s not cheap.<br
/> Now, the real question was if I thought it would help a trainee (or a beginner) understand the issues that arise with FDTD. In that matter, I think it does, also a lot can be skipped if electromagnetism is not the application field. It will not teach the specifics of rotated grids or higher orders, but if you&#8217;re explained the basics, you can understand more complex issues.</p><div
style="border: 1px solid #000; padding: 5px; margin-bottom: 15px; background: url(http://matt.eifelle.com/wp-content/uploads/2009/12/BN_Logo_3tier.jpg) right bottom no-repeat #ffffff;"> <a
rel="nofollow" href="http://r.popshops.com/pp/69415/parallel-finite-difference-time-domain-method"><img
style="width: 150px;" src="http://images.barnesandnoble.com/images/17920000/17925109.JPG" border="0" alt="Parallel Finite-Difference Time-Domain Method" /></a><br
/> <a
rel="nofollow" href="http://r.popshops.com/pp/69415/parallel-finite-difference-time-domain-method">Parallel Finite-Difference Time-Domain Method</a><br
/> Price: $101.6</div><div
class="subcolumns"><div
style="border: 1px solid #000; padding: 5px; margin-bottom: 15px; background: url(http://matt.eifelle.com/wp-content/plugins/amazonsimpleadmin/img/amazon_US_small.gif) right bottom no-repeat #ffffff;"><div
style="width: 50px; float: left; margin-right: 5px;"> <a
href="http://www.amazon.com/exec/obidos/ASIN/1596930853/masbl03-20" target="_blank"><img
src="http://ecx.images-amazon.com/images/I/41jD5NvjA1L._SL75_.jpg" width="50" height="75" border="0" /></a></div><div><p><a
href="http://www.amazon.com/exec/obidos/ASIN/1596930853/masbl03-20" target="_blank">Parallel Finite-Difference Time-Domain Method (Artech House Electromagnetic Analysis)</a> (Hardcover)<br
/> <span
style="font-size: 0.8em;">by <strong>Wenhua Yu, Raj Mittra, Tao Su, Yongjun Liu, Xiaoling Yang</strong></span><br
/> ISBN: 1596930853</p><p><strong>Price:</strong> <span
style="color: #990000; font-weight: bold;">USD 127.00</span><br
/> <strong>16 used &#038; new</strong> available from <span
style="color: #990000; font-weight: bold;">USD 94.35</span></p><p> |  | 0</div><div
style="clear: both;"></div></div></div>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2009/02/03/book-review-parallel-finite-difference-time-domain-method/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>The different faces of HPC</title><link>http://matt.eifelle.com/2009/01/20/the-different-faces-of-hpc/</link> <comments>http://matt.eifelle.com/2009/01/20/the-different-faces-of-hpc/#comments</comments> <pubDate>Tue, 20 Jan 2009 08:16:26 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[Parallel and Distributed Computing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=242</guid> <description><![CDATA[For each algorithm and program, there are architectures that are better than others. Some computation may need a lot of FLOPS, but FLOPS are not the only thing to consider. Communication and memory bandwidth and latency are as important as computational power, specially since memory speed and CPU speed are decoupled.Raw computational power needs
With the [...]]]></description> <content:encoded><![CDATA[<p>For each algorithm and program, there are architectures that are better than others. Some computation may need a lot of FLOPS, but FLOPS are not the only thing to consider. Communication and memory bandwidth and latency are as important as computational power, specially since memory speed and CPU speed are decoupled.</p><p><span
id="more-242"></span></p><h4>Raw computational power needs</h4><p>With the raw computational power we have, it is possible to simulate more and more complicated models. So the newest processors will never be enough, more precise models will be used (http://www.ddj.com/cpp/205900309)</p><p>Still, more complicated models means more memory, more communications and more I/Os. For some applications, this is not a problem. For instance, Folding@Home sends a small data set that a computer will process and then an answer is sent to the server. In this model, there is not much communications, as each data set can be processed by itself.</p><h4>Memory bandwidth and latency needs</h4><p>For other applications, memory bandwidth and its latency are the real bottleneck: the application spends more time waiting for data than actually computing. And there are different levels in the bottleneck.</p><p>Sometimes, all data can fit inside the L2 cache, so the CPU can almost be fully used. Then, perhaps the data fit inside the RAM, and there are a lot of exchanges between RAM and L2 (for instance in finit difference schemes). In this case, the CPU can sometimes wait for data. There are strategies to enhance this, but the COU will always be idle at some point.</p><p>I don&#8217;t even talk about cases where the model cannot fit inside the memory, and still is needed at every stage of the computation.</p><h4>Communication needs</h4><p>When the computation cannot fit in one computer or one cluster node, applications may face another bottleneck. This is where you have to choose between a grid, a cluster of workstations or a real cluster. A massive grid, like the one used for Folding@Home, solves problems that can&#8217;t be solved by a cluster of workstations (in an acceptable time). The latter can solve for instance some medical image processing problems (like detecting differences between 3D brain images of different populations), as the amount of work for one step (like the normalization), can be done on one node but the communication amount is too big to be achieved on the Internet (sending hundreds of megabytes for each result). Then, other medical imaging processings, like the evolution of an artery during a cardiac cycle, need a real cluster with low-latency communications (during each iteration, data must be transmitted to different nodes, and this can only be achived through fast and low latency network interfaces).</p><h4>What is really needed?</h4><p>In the end, before you choose the architecture you need, you have to write down the actual problem you want to solve. Buying a cluster if your program won&#8217;t benefit from it will be a waste of money. The same can be said for the processor you will use. A processor with a high number of FLOPS, but with a low memory bandwidth can be worse than a processor not so fast but with a high memory bandwidth. It all depends on our problem.</p>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2009/01/20/the-different-faces-of-hpc/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> </channel> </rss>
<!-- Served from: matt.eifelle.com @ 2010-07-30 08:26:05 by W3 Total Cache -->