<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
><channel><title>Matthieu Brucher&#039;s blog &#187; High Performance Computing</title> <atom:link href="http://matt.eifelle.com/category/general/distributed-computing/high-performance-computing/feed/" rel="self" type="application/rss+xml" /><link>http://matt.eifelle.com</link> <description></description> <lastBuildDate>Tue, 27 Jul 2010 07:04:23 +0000</lastBuildDate> <generator>http://wordpress.org/?v=2.9.1</generator> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <item><title>Optimally use massively parallel clusters resources</title><link>http://matt.eifelle.com/2010/06/15/optimally-use-massively-parallel-clusters-resources/</link> <comments>http://matt.eifelle.com/2010/06/15/optimally-use-massively-parallel-clusters-resources/#comments</comments> <pubDate>Tue, 15 Jun 2010 07:53:10 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[Tools]]></category> <category><![CDATA[Batch scheduling]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=1208</guid> <description><![CDATA[We have now several petaflopic clusters available in the Top500. Of course, we are trying to get the most of their peak computational power, but I think we should sometimes also look at optimal resource allocation.
I&#8217;ve been thinking about this for several months now, for work that has thousands of tasks, each task being massively [...]]]></description> <content:encoded><![CDATA[<p>We have now <a
href="http://www.top500.org/">several petaflopic clusters available in the Top500</a>. Of course, we are trying to get the most of their peak computational power, but I think we should sometimes also look at optimal resource allocation.</p><p>I&#8217;ve been thinking about this for several months now, for work that has thousands of tasks, each task being massively data parallel. Traditionnally, one launches a job through one&#8217;s favorite batch scheduler (favorite or mandatory&#8230;) with fixed resources and during an estimated amount of time. This may work well in research, but in the industrial world, there often a new job that arises and that needs part of your scarce resources. You may have to stop your work, loose your current advances and/or restart the job with less resources. And then the cycle goes on.</p><p><span
id="more-1208"></span></p><h4>Static resource allocation</h4><p>How can resource allocation work? Let&#8217;s start with a simple case where you have 2 applications with different priorities. One of them has a priority of 70 (it&#8217;s supposed to finish in three days) whereas the other one has a priority of 50 (four days left). They share the cluster so that 66% is allocated to the first application and 33% to the second one.<br
/><center><a
href="http://matt.eifelle.com/wp-content/uploads/2010/06/Allocation-2.png"><img
src="http://matt.eifelle.com/wp-content/uploads/2010/06/Allocation-2-300x165.png" alt="" title="Dispatch and allocation of two applications" width="300" height="165" class="aligncenter size-medium wp-image-1241" /></a></center></p><p>What happens if a third application must be launched with a higher priority, because it has to ne finished by tomorrow? You may stop the other two programs, you may loose a lot of work if you didn&#8217;t implement checkpoints (besides, one of them may be an of-the-shelf program you bought yesterday) or suspend it. Either way, this is what you will get:<br
/><center><a
href="http://matt.eifelle.com/wp-content/uploads/2010/06/Allocation-3.png"><img
src="http://matt.eifelle.com/wp-content/uploads/2010/06/Allocation-3-300x165.png" alt="" title="Dispatch and allocation for three applications" width="300" height="165" class="aligncenter size-medium wp-image-1242" /></a></center></p><p>In fact, even if you use dynamic resource allocation, this is what you must get to have your results by the time you need them, but obviously, you have lost your two other applications. Some batch schedulers allow applications to be suspended, but this is a double-edge sword:</p><ul><li>your cluster must support job suspension, and thus have access to drives to save the job state (which is not possible for medium to large-scaled clusters)</li><li>if your application does not scale to your entire cluster (it happens), although one of the other two applications could go on, it is not possible, all processes are put to sleep</li></ul><p>So all things considered, you have to implement dynamic resource allocation.</p><h4>Dynamic resource allocation</h4><p>How does this work? Each application must be aware that it can be allocated more resources or deallocated some at all time. To be portable on all clusters, you cannot suspend part of your program, it must really go away. The batch scheduler must also notice that your application has freed some of its resources. You thus have to allocate small jobs that will communicate together (this can be done with MPI-2).</p><p>This means that you will have hundreds or thousands of small works. All of them will not have to be connected to the scheduler, only one master must be. Of course, this can easilly be done by using a specific queue. Each application on this queue will thus receive orders from the batch scheduler and act upon it. Another advantage is that also the application gets no resource at one point, it still has a saved state that enable the continuation of a run.<br
/><center><a
href="http://matt.eifelle.com/wp-content/uploads/2010/06/Dynamic-workflow.png"><img
src="http://matt.eifelle.com/wp-content/uploads/2010/06/Dynamic-workflow-300x165.png" alt="" title="Dynamic resource allocation workflow" width="300" height="165" class="aligncenter size-medium wp-image-1244" /></a></center></p><p>Of course, this is not easy to do. How can this be applied to an of-the-shelf application? Well, in this case, you may create a bogus application on the master queue that will at least allow other applications to be allocated resources beside it.</p><p>You do not have to implement this on top of MPI. It can be really hard to do (handling data moves between processors, change the decomposition, &#8230;), and you may implement another solution. In my case, I have thousands different tasks that can be run on very few cores, so this is my elementary unit. I don&#8217;t need all tasks to communicate between them, so I create each time brand new independent jobs and I also can tell the scheduler it can kill jobs that are not responding before the next allocation phase.</p><h4>Conclusion</h4><p>To finish, I&#8217;ll say that I know that <a
href="http://www.platform.com/">LSF</a> allows plugins that help dispatch jobs on specific hosts of your cluster (to have the best communication location). There seems to be a way of implementing the needs gathering and the resource assignment, but the documentation is not clear (at all). A specific daemon may be needed. I don&#8217;t know if other batch scheduler allow plugins to modify their behavior, if you know of them and their API, please do tell <img
src='http://matt.eifelle.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /></p>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2010/06/15/optimally-use-massively-parallel-clusters-resources/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Book review: Programming Massively Parallel Processors: A Hands-on Approach</title><link>http://matt.eifelle.com/2010/03/31/book-review-programming-massively-parallel-processors-a-hands-on-approach/</link> <comments>http://matt.eifelle.com/2010/03/31/book-review-programming-massively-parallel-processors-a-hands-on-approach/#comments</comments> <pubDate>Wed, 31 Mar 2010 07:42:28 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[Book review]]></category> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[Morgan Kaufmann]]></category> <category><![CDATA[Tools]]></category> <category><![CDATA[CUDA]]></category> <category><![CDATA[Parallel and Distributed Computing]]></category> <category><![CDATA[Parallel computing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=1161</guid> <description><![CDATA[Massively parallel processors are in the mood today. We had small parallel processors with a few cores and the ability to launch serevral threads on one core, we have now many cores on one processor and at the other end of the spectrum, we have GPUs. CPUs vendors are now going in this direction with [...]]]></description> <content:encoded><![CDATA[<p>Massively parallel processors are in the mood today. We had small parallel processors with a few cores and the ability to launch serevral threads on one core, we have now many cores on one processor and at the other end of the spectrum, we have GPUs. CPUs vendors are now going in this direction with Larabee and Fusion, and GPUs will still have more cores/threads/&#8230; It&#8217;s thus mandatory to understand this shift now.<br
/> <span
id="more-1161"></span></p><h4>Content and opinions</h4><p>First of all, it&#8217;s not a book on programming massively parallel processors, it&#8217;s a book about CUDA. One of the authors is a nVidia fellow, so it&#8217;s no wonder. I think there are three parts in the book: an introduction of CUDA, two examples and then general considerations and the future.</p><p>The first 6 chapters (I don&#8217;t count the first chapter as a real chapter, it&#8217;s more of an introduction to the massively paralell processors and their use in a few pages) are the main CUDA tutorial. I say tutorial because it feel like all beginner courses I&#8217;ve taken in CUDA. The content can be found in all Internet classes, so the only advantage is that you have everything in a book. Nothing less, nothing more.</p><p>I had a feeling of &#8220;deja vu&#8221; for the MRI example, the second was unknown to me. There is not much code, only for the relevant parts, but you won&#8217;t be able to test the different implementations with what is provided in the book. Besides, several times during the writting flow, new techniques are introduced, but one can&#8217;t know what speed-up they provide. Perhaps this is because this speedup cannot be generalized, but still, with proper warnings, the different timings through the GPU port of woth examples would have been great.</p><p>The last part is, as I&#8217;ve said, more general. It starts with a workflow to help parallelizing with GPUs, then an introduction (too short IMHO) of OpenCL and the future of CUDA with Fermi and the SDK 3.0. The workflow chapter is too small. Of course, the goal isn&#8217;t to be like <a
href="http://matt.eifelle.com/2009/12/08/book-review-the-art-of-concurrency-a-thread-monkeys-guide-to-writing-parallel-applications/">The Art of Concurrency</a>, and at least there is a chapter about the process of selecting the algorithm, &#8230; but it is too small. The OpenCL introduction is really an introduction. I&#8217;ve seen one small complete OpenCL call, but that&#8217;s it. I couldn&#8217;t program a single kernel right now. Of course it&#8217;s a CUDA book, not an OpenCL one, but the chapter is useless. Perhaps it would be better to merge it with the &#8220;future&#8221; chapter, as OpenCL is not widely available. Finally, the last chapter states what can be expected of Fermi (really interesting) and of the SDK 3.0.</p><p>What I miss in this book is some explanations of the texture memory. The obvious matrix example uses constant memory for caching the memory accesses. Why isn&#8217;t texture memory used in this example? It&#8217;s far bigger than constant memory and also has a cache, so why not use it? It&#8217;s a CUDA book, but a lot of content is freely available in several tutorials that are sometimes better shaped than the book, so why isn&#8217;t there some special content, like how the cache works? How can you manage grid sizes that are no a power of two? (it&#8217;s explained in one of the example, with zero padding, but there are no protection in the first chapters, which is dangerous) What is coalescing memory and how can I optimize the memory bandwidth with coalescing in mind? (the actual real explanation and appropriate picture is in the last annexe!)</p><h4>Conclusion</h4><p>I don&#8217;t say that the book is not useful, it&#8217;s really interesting as a companion book for a CUDA course or for a beginner. If you&#8217;re used to electronic papers, you will not be interested. If you buy this book, don&#8217;t expect to know everything about CUDA, or even less massively parallel processors. You will have to dig deeper for specific topics, but at least you will have a good basis.</p><div
style="border: 1px solid #000; padding: 5px; margin-bottom: 15px; background: url(http://matt.eifelle.com/wp-content/uploads/2009/12/BN_Logo_3tier.jpg) right bottom no-repeat #ffffff;"> <a
rel="nofollow" href="http://r.popshops.com/pp/78348/programming-massively-parallel-processors-a-hands-on-approach"><img
style="width: 150px;" src="http://images.barnesandnoble.com/images/47190000/47190706.JPG" border="0" alt="Programming Massively Parallel Processors: A Hands-on Approach" /></a><br
/> <a
rel="nofollow" href="http://r.popshops.com/pp/78348/programming-massively-parallel-processors-a-hands-on-approach">Programming Massively Parallel Processors: A Hands-on Approach</a><br
/> Price: $62.95</div><div
class="subcolumns"><div
style="border: 1px solid #000; padding: 5px; margin-bottom: 15px; background: url(http://matt.eifelle.com/wp-content/plugins/amazonsimpleadmin/img/amazon_US_small.gif) right bottom no-repeat #ffffff;"><div
style="width: 60px; float: left; margin-right: 5px;"> <a
href="http://www.amazon.com/exec/obidos/ASIN/0123814723/masbl03-20" target="_blank"><img
src="http://ecx.images-amazon.com/images/I/51VL9FqF6ML._SL75_.jpg" width="60" height="75" border="0" /></a></div><div><p><a
href="http://www.amazon.com/exec/obidos/ASIN/0123814723/masbl03-20" target="_blank">Programming Massively Parallel Processors: A Hands-on Approach</a> (Paperback)<br
/> <span
style="font-size: 0.8em;">by <strong>David B. Kirk, Wen-mei W. Hwu</strong></span><br
/> ISBN: 0123814723</p><p><strong>Price:</strong> <span
style="color: #990000; font-weight: bold;">USD 51.99</span><br
/> <strong>40 used &#038; new</strong> available from <span
style="color: #990000; font-weight: bold;">USD 46.52</span></p><p> <img
src="http://matt.eifelle.com/wp-content/plugins/amazonsimpleadmin/img/stars-3.5.gif" class="asa_rating_stars" /> | 3.5 | 8</div><div
style="clear: both;"></div></div></div>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2010/03/31/book-review-programming-massively-parallel-processors-a-hands-on-approach/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Thinking of good practices when developing with accelerators</title><link>http://matt.eifelle.com/2010/01/05/thinking-of-good-practices-when-developing-with-accelerators/</link> <comments>http://matt.eifelle.com/2010/01/05/thinking-of-good-practices-when-developing-with-accelerators/#comments</comments> <pubDate>Tue, 05 Jan 2010 08:48:57 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[C++]]></category> <category><![CDATA[Design Patterns]]></category> <category><![CDATA[Development process]]></category> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[Tools]]></category> <category><![CDATA[CUDA]]></category> <category><![CDATA[Fortran]]></category> <category><![CDATA[Grid computing]]></category> <category><![CDATA[HMPP]]></category> <category><![CDATA[MPI]]></category> <category><![CDATA[Multithreaded applications]]></category> <category><![CDATA[Scientific computing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=997</guid> <description><![CDATA[Due to the end of the free lunch, manufacturers started to provide differents processing units and developers started to go parallel. It&#8217;s kind of back to the future, as accelerators existed before today (the x87 FPU started as a coprocessor, for instance). If those accelerators were integrated into the CPU, their instruction set were also.
Today&#8217;s [...]]]></description> <content:encoded><![CDATA[<p>Due to the end of the <a
href="http://www.gotw.ca/publications/concurrency-ddj.htm">free lunch</a>, manufacturers started to provide differents processing units and developers started to go parallel. It&#8217;s kind of back to the future, as accelerators existed before today (the x87 FPU started as a coprocessor, for instance). If those accelerators were integrated into the CPU, their instruction set were also.</p><p>Today&#8217;s accelerators are not there yet. The tools are not ready yet (code translators) and usual programming practices may not be adequate. All the ecosystem will evolve, accelerators will change (GPUs are the main trend, but they will be different in a few years), so what you will do today needs to be shaped with these changes in mind. How is it possible to do so? Is it even possible?<br
/> <span
id="more-997"></span></p><h4>Available code translators</h4><p>Code translators are the easiest path to solution. I know two of them.</p><p>The first is the <a
href="http://www.pgroup.com/resources/accel.htm">PGI compiler</a>. It only supports CUDA and the Fortran and C99 language. I didn&#8217;t use it yet, also I plan of testing it in the near future. It is based on pragmas, and the compiler generates the CUDA microcode.</p><p>The second solution is <a
href="http://www.caps-entreprise.com/fr/page/index.php?id=49&amp;p_p=36">HMPP</a>. It supports more than just CUDA (also CAL/IL or OpenCL) and Fortran/C (also Java now). As the PGI compiler, it is based on pragmas, and a excellent thing is that it detects the available accelerators and launches the correct kernel (if you authorized it) or the original code. You can also modify the generated code to put your own (you can tune the code for instance, which may give you an additional x2 factor). Unfortunately, it is not possible to call functions inside the parallelized kernels, which means that only simple or badly-written (too many lines or duplicated code) kernels can be called. I think this is the same for the PGI compiler.</p><p>It seems that code translators still need work:</p><ul><li>only few accelerators are supported (CUDA, and sometimes CAL/IL or OpenCL),</li><li>almost no langage (Fortran/C/Java, a lot of Virtual Machines should be able to use them natively, without developers using specific tools),</li><li>only one function can be parallelized at a time.</li></ul><p>The last point is currently the biggest issue. You need to cut your function int pieces to have clean code and a good portability/evolutivity for the future.</p><p>This is why one still need to program a lot for those accelerators, and so we need to adapt our programming practices, develop in the accelerators&#8217; native langages (even if we know that they may disappear in a few years).</p><h4>Developping your own &#8220;tool chain&#8221; for accelerators</h4><p>For accelerators, there are a lot of things that needs to be done each time: copying some data, computing and getting some data back. These are the steps that code translators automate, in fact it is a common practice to use tools to automate stuff. The issue is that complex kernels are not supported by those translators. So what?</p><p>Creating automatic functions that will copy the data you need is in fact very common in metaprogramming. Coding the kernel on an accelerator is in fact not that difficult: the manufacturers provide the needed compilers (that&#8217;s what nVidia does and the success of the tool chain cannot be denied), and this is really the cornerstone. One has to write more code, some parts are less portable (because they are written in one of the accelerator&#8217;s languages), but in the end, with metaprogramming, the code can be better tuned, enhanced and read. This is the leverage of the accelerators.</p><h4>Conclusion</h4><p>Why do we care developing for accelerators? We know that they will go away. Before they do, they are the only way of speeding up our software. Code translators are the best tools to develop in a portable way, but they need time to support more accelerators, languages and method of programming. When CPUs will be on a par with accelerators, their progress will help compilers to target them correctly. It&#8217;s just a matter of time.<br
/> Meanwhile, metaprogrammin is the next best solution to automate processes that code translators cannot support yet.</p>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2010/01/05/thinking-of-good-practices-when-developing-with-accelerators/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Overview of TotalView, a parallel debugger</title><link>http://matt.eifelle.com/2009/03/31/overview-of-totalview-a-parallel-debugger/</link> <comments>http://matt.eifelle.com/2009/03/31/overview-of-totalview-a-parallel-debugger/#comments</comments> <pubDate>Tue, 31 Mar 2009 08:15:12 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[C++]]></category> <category><![CDATA[Debugger]]></category> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[Fortran]]></category> <category><![CDATA[MPI]]></category> <category><![CDATA[Multithreaded applications]]></category> <category><![CDATA[OpenMP]]></category> <category><![CDATA[Parallel computing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=265</guid> <description><![CDATA[Some months ago, I had a TotalView tutorial, thanks to my job. Now, I&#8217;ve actually used it to debug one of my parallel applications and I would like to share my experience with fantastic tool.
First TotalView is not only a parallel debugger available on several Linux and Unix platforms. It also is a memory checker [...]]]></description> <content:encoded><![CDATA[<p>Some months ago, I had a TotalView tutorial, thanks to my job. Now, I&#8217;ve actually used it to debug one of my parallel applications and I would like to share my experience with fantastic tool.<br
/> First TotalView is not only a parallel debugger available on several Linux and Unix platforms. It also is a memory checker (MemoryScape and the TotalView plugin) as well as a reverse debugger, that is, you can roll back the execution of a program, even after it crashed (where it would be useless with a standard debugger like GDB).<br
/> <span
id="more-265"></span></p><h4>TotalView</h4><p>Inside the main TotalView window, each program with its threads and processes can be accessed, reopened, even if you closed the application window. The only drawback is that it is not possible to remove an application from this window&#8230;</p><div
id="attachment_343" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_main.png"><img
class="size-medium wp-image-343" title="totalview_main" src="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_main-300x158.png" alt="TotalView main window" width="300" height="158" /></a><p
class="wp-caption-text">TotalView main window</p></div><p>Launching Totalview raises a window allowing to launch a new program, attach to a running one or analyze a core dump. If the application uses MPI, it must be indicated (several implementations are available).</p><div
id="attachment_344" class="wp-caption aligncenter" style="width: 310px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_open.png"><img
class="size-medium wp-image-344" title="totalview_open" src="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_open-300x130.png" alt="TotalView open a program windows" width="300" height="130" /></a><p
class="wp-caption-text">TotalView open a program windows</p></div><p>Once the application is launched, it is possible to actually debug it. The interface shows which process and thread is currently selected (the list of processes and threads is available in the lowest tab window). Unfortunately, there is no way to browse the code, so you have to go through your code (you can &#8220;dive&#8221; into a function by double-clicking on a call) to put a breakpoint somewhere.</p><p>For TotalView, breakpoints are a special case of action points. On action points, you can stop the program, or execute a simple code. You can also tell Totalview to stop the program when the program went a specific number of time through an instruction (efficient when the error shows up at the hundredth-or-so iteration of a loop).</p><p>There are several ways of stopping when arriving at an action point: stopping as soon as one thread/process arrives, when all arrived at it, a group, &#8230; There also a lot of other functions that are quite usefull.</p><div
id="attachment_345" class="wp-caption aligncenter" style="width: 245px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_start.png"><img
class="size-medium wp-image-345" title="totalview_start" src="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_start-235x300.png" alt="TotalView debugging window" width="235" height="300" /></a><p
class="wp-caption-text">TotalView debugging window</p></div><p>Exploring variables is one of the obvious uses of a debugger. Without it, debugging is often useless. TotalView allows to &#8220;dive&#8221; into a variable, and then explore it. A multi-dimensional variable can be sliced, and then compared between processes. When a variable is modified, it appears in yellow. It is then possible to compare an MPI communication result (for instance).</p><p>When comparing to other parallel debugger (like DDT), the array display is not as beautiful. TotalView has other advantages, as having its own C/C++/Fortran debugger, without relying on gdb.</p><div
id="attachment_346" class="wp-caption aligncenter" style="width: 260px"><a
href="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_variables.png"><img
class="size-medium wp-image-346" title="totalview_variables" src="http://matt.eifelle.com/wp-content/uploads/2009/01/totalview_variables-250x300.png" alt="TotalView variable display window" width="250" height="300" /></a><p
class="wp-caption-text">TotalView variable display window</p></div><h4>MemoryScape</h4><p>MemoryScape is TotalView&#8217;s memory tool. It captures OS memory calls and watches what the application does.</p><p>The first option is to quard memory blocks. It&#8217;s less efficient that Fortran&#8217;s bound checks, but it is less costly (as the memory guards are only checked when the program stops). Other options include paint blocks (a pattern is &#8220;painted&#8221; inside the block, and if it shows up somewhere else in the code, it&#8217;s that the block wasn&#8217;t worrectly initializd, for instance), hoarded memory (deallocated memory is not immediatelly freed, which can then lead to detect memory corruption) and of course leak detection.</p><p>Several graphs can be drawn, but some are misleading (as the memory pie, which does not show the truth).</p><h4>ReplayEngine</h4><p>Replay Engine is a reverse debugger. When the program crashed, it is possible to rewind the execution to find where the problem first showed up.</p><p>Of course, the rewind option is based on snapshots, which means that you cannot replay a really big program (that uses several GB), that ReplayEngine chooses when to do a snapshot, and it is possible that the instant you want was not captured. I never used the ReplayEngine because of these pitfalls (no reverse debugger can escape them).</p><h4>Conclusion</h4><p>Although it is pretty much expensive, TotalView is very helpfull. When I had to parallize with MPI a scientific code, it was simple to use the MPI library I used, and the variable display helped me fix the communications in no time.</p><p>I never had a real use for MemoryScape. The leak detection is efficient, but like Valgrind, some detected leaks are not real leaks. The guarded memory could have been useful, but as I had read issues, it couldn&#8217;t help me.</p><p>In the end, I would recommand TotalView as a parallel debugger. With an efficient parallel profiler, it is one of the need-to-have tools in one&#8217;s toolbox.</p><p>Link to the official TotalView website: <a
href="http://totalviewtech.com/">http://totalviewtech.com/</a></p><form
action="https://www.paypal.com/cgi-bin/webscr" method="post"> <input
type="hidden" name="cmd" value="_xclick" /> <input
type="hidden" name="business" value="matthieu.brucher@gmail.com" /><input
type="hidden" name="item_name" value="Buy Me a Coffee!" /><input
type="hidden" name="currency_code" value="USD" /><span
style="font-size:10.0pt"><strong> Buy Me a Coffee!</strong></span><br
/><br
/><select
id="amount" name="amount" class=""><option
value="3">Capuccino - 3$</option><option
value="6">Frappuccino - 6$</option><option
value="10">Hot Chocolate - 10$</option><option
value="20">Expensive Coffee - 20$</option><option
value="50">Alien Coffee - 50$</option></select><br
/><br
/><strong>Other Amount:</strong><br
/><br
/><input
type="text" name="amount" size="10" title="Other donate" value="" /><br
/><br
/><strong> Your Email Address :</strong><input
type="hidden" name="on0" value="Reference" /><br
/><br
/><input
type="text" name="os0" maxlength="60" /> <br
/><br
/> <input
type="hidden" name="no_shipping" value="2" /> <input
type="hidden" name="no_note" value="1" /> <input
type="hidden" name="mrb" value="3FWGC6LFTMTUG" /> <input
type="hidden" name="bn" value="IC_Sample" /> <input
type="hidden" name="return" value="http://matt.eifelle.com" /><input
type="image" src="https://www.paypal.com/en_US/i/btn/x-click-but11.gif" name="submit" alt="Make payments with payPal - it's fast, free and secure!" /></form>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2009/03/31/overview-of-totalview-a-parallel-debugger/feed/</wfw:commentRss> <slash:comments>1</slash:comments> </item> <item><title>The different faces of HPC</title><link>http://matt.eifelle.com/2009/01/20/the-different-faces-of-hpc/</link> <comments>http://matt.eifelle.com/2009/01/20/the-different-faces-of-hpc/#comments</comments> <pubDate>Tue, 20 Jan 2009 08:16:26 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[Distributed Computing]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[Parallel and Distributed Computing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=242</guid> <description><![CDATA[For each algorithm and program, there are architectures that are better than others. Some computation may need a lot of FLOPS, but FLOPS are not the only thing to consider. Communication and memory bandwidth and latency are as important as computational power, specially since memory speed and CPU speed are decoupled.Raw computational power needs
With the [...]]]></description> <content:encoded><![CDATA[<p>For each algorithm and program, there are architectures that are better than others. Some computation may need a lot of FLOPS, but FLOPS are not the only thing to consider. Communication and memory bandwidth and latency are as important as computational power, specially since memory speed and CPU speed are decoupled.</p><p><span
id="more-242"></span></p><h4>Raw computational power needs</h4><p>With the raw computational power we have, it is possible to simulate more and more complicated models. So the newest processors will never be enough, more precise models will be used (http://www.ddj.com/cpp/205900309)</p><p>Still, more complicated models means more memory, more communications and more I/Os. For some applications, this is not a problem. For instance, Folding@Home sends a small data set that a computer will process and then an answer is sent to the server. In this model, there is not much communications, as each data set can be processed by itself.</p><h4>Memory bandwidth and latency needs</h4><p>For other applications, memory bandwidth and its latency are the real bottleneck: the application spends more time waiting for data than actually computing. And there are different levels in the bottleneck.</p><p>Sometimes, all data can fit inside the L2 cache, so the CPU can almost be fully used. Then, perhaps the data fit inside the RAM, and there are a lot of exchanges between RAM and L2 (for instance in finit difference schemes). In this case, the CPU can sometimes wait for data. There are strategies to enhance this, but the COU will always be idle at some point.</p><p>I don&#8217;t even talk about cases where the model cannot fit inside the memory, and still is needed at every stage of the computation.</p><h4>Communication needs</h4><p>When the computation cannot fit in one computer or one cluster node, applications may face another bottleneck. This is where you have to choose between a grid, a cluster of workstations or a real cluster. A massive grid, like the one used for Folding@Home, solves problems that can&#8217;t be solved by a cluster of workstations (in an acceptable time). The latter can solve for instance some medical image processing problems (like detecting differences between 3D brain images of different populations), as the amount of work for one step (like the normalization), can be done on one node but the communication amount is too big to be achieved on the Internet (sending hundreds of megabytes for each result). Then, other medical imaging processings, like the evolution of an artery during a cardiac cycle, need a real cluster with low-latency communications (during each iteration, data must be transmitted to different nodes, and this can only be achived through fast and low latency network interfaces).</p><h4>What is really needed?</h4><p>In the end, before you choose the architecture you need, you have to write down the actual problem you want to solve. Buying a cluster if your program won&#8217;t benefit from it will be a waste of money. The same can be said for the processor you will use. A processor with a high number of FLOPS, but with a low memory bandwidth can be worse than a processor not so fast but with a high memory bandwidth. It all depends on our problem.</p>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2009/01/20/the-different-faces-of-hpc/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>How to promote High Performance Computing ?</title><link>http://matt.eifelle.com/2008/11/20/how-to-promote-high-performance-computing/</link> <comments>http://matt.eifelle.com/2008/11/20/how-to-promote-high-performance-computing/#comments</comments> <pubDate>Thu, 20 Nov 2008 08:14:55 +0000</pubDate> <dc:creator>Matt</dc:creator> <category><![CDATA[General]]></category> <category><![CDATA[High Performance Computing]]></category> <category><![CDATA[Computer architecture]]></category> <category><![CDATA[Design pattern]]></category> <category><![CDATA[Education]]></category> <category><![CDATA[Fortran]]></category> <category><![CDATA[MPI]]></category> <category><![CDATA[Parallel computing]]></category> <category><![CDATA[Scientific computing]]></category><guid
isPermaLink="false">http://matt.eifelle.com/?p=150</guid> <description><![CDATA[I had this discussion with one of my Ph.D. advisors some months ago when we talked about correctly using the computers we had then (dual cores), and I had almost the same one in my new job here: applied maths (finite differences, signal processing, &#8230;) graduate students are not taught how to use current computers, [...]]]></description> <content:encoded><![CDATA[<p>I had this discussion with one of my Ph.D. advisors some months ago when we talked about correctly using the computers we had then (dual cores), and I had almost the same one in my new job here: applied maths (finite differences, signal processing, &#8230;) graduate students are not taught how to use current computers, so how could they develop an HPC program correctly?</p><p>I think it goes even further than that, and it will be a part of this post. What I see is that trainees and newly-hired people (to some extent myself included) lack a lot of basic Computer Science knowledge, and even IT knowledge.<br
/> <span
id="more-150"></span></p><h4>General knowledge</h4><p>A crucial issue is the computer architecture knowlegde:</p><ul><li>What endinanness do I use on my computers? A lot of scientific code was programmed in big endian, whereas current mainstream computers are little endian (safe for the Power architecture, thus Cell, IBM clusters, &#8230;). Although this is taught during the first programming courses, some students tend to forget about it.</li><li>How should I loop on data? If you loop on discontiguous data block, the CPU cache is likely to be too small to contain all your data. This will lead to cache misses, and those can cost a lot.</li><li>More generally how much memory can I use and what is its speed (bandwidth and latency)? On current computers you have access to several GB of RAM, MB of cache. On GPUs (which is more and more used for HPC), you have hundreds of MB, but access to the graphic global RAM is costly (the same applies for the CELL).</li></ul><p>At least students must learn that they will have to think about those issues. I don&#8217;t know each memory size of each processor (I roughly know how much, and if I need more details, I search the Internet for the answer).</p><h4>Parallel knowledge</h4><p>Then there is the parallel part. Even for scientific research on small datasets, the advent of multicores is a challenge for students (and to a lot of teachers as well). We can&#8217;t expect them to know how to deal with multithreading or multiprocessing if they don&#8217;t even know the difference between an Intel chip and an IBM one.</p><p>In medical imaging, we could simply split our workflow in tasks (we had several registrations to do, so we can simply split the work and make on each core one registration), but even that was not easy. Some algorithms were not thread-safe (who said that global variables should be avoided?), so we had to launch several different processes. But when it comes to parallelize the registration itself because it was long, and also because the problem size began to be too big for a 32bits application (yes, we were still in 32bits because of legacy GTK1 applications), nobody could do it, because nobody had the time and the knowledge to use all the CPU power.</p><p>In bigger applications like sismic ones, it&#8217;s even more obvious. Copy &#8216;n&#8217; Paste can be a sport, if someone one day parallized efficiently his program. All things considered, it&#8217;s not a problem, but if the paste is not done smartly and if the developer didn&#8217;t learn from the analyze of the first program, it&#8217;s useless, because he didn&#8217;t learn AND because the pasted program will not be optimal and even subject to bugs.</p><h4>Some additional thoughts</h4><p>I found <a
href="http://software.intel.com/en-us/blogs/2008/11/12/sequential-programming-is-no-more-lets-teach-parallel-programming/">a post</a> on Intel&#8217;s Community page about <a
href="http://scyourway.nacse.org/conference/view/edu_cl101">questions that will be discussed at SuperComputing &#8216;08</a>. It shows that there is no clear answer to this problem (at this time of writing). We should teach parallelism, HPC, &#8230; to students. But are the tools ready? Is every student even able to cope with parallelism?</p>]]></content:encoded> <wfw:commentRss>http://matt.eifelle.com/2008/11/20/how-to-promote-high-performance-computing/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> </channel> </rss>
<!-- Served from: matt.eifelle.com @ 2010-07-30 08:21:52 by W3 Total Cache -->