<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Hi Martin, another alternative to useSingleCPUio is the code written by<div class="">NASA Ames folks for the 1/48 global ocean simulation and checked in here:<div class=""><a href="http://wwwcvs.mitgcm.org/viewvc/MITgcm/MITgcm_contrib/llc_hires/llc_4320/code-async/" class="">http://wwwcvs.mitgcm.org/viewvc/MITgcm/MITgcm_contrib/llc_hires/llc_4320/code-async/</a></div><div class=""><br class=""></div><div class="">As you suggest it’s a n on m strategy, except that you request additional CPUs, 2 to 5% typically,</div><div class="">that just do I/O.  The code takes care of distributing these CPUs judiciously among compute nodes.</div><div class="">The asyncio code literally makes the cost of I/O disappear, as long as the disk can keep up with dump frequency.</div><div class=""><br class=""></div><div class=""><div class="">

Dimitris Menemenlis


</div>

<br class=""><div><blockquote type="cite" class=""><div class="">On Aug 10, 2017, at 9:14 AM, Martin Losch <<a href="mailto:Martin.Losch@awi.de" class="">Martin.Losch@awi.de</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">Hi all,<br class=""><br class="">I just heard an interesting talk about optimizing I/O, which we could try also for the MITgcm (mdsio).<br class="">The idea is to use more than one CPU for I/O, but not all, to write individual output streams. E.g., Instead of one CPU as for useSingleCPUio, one could have a list of cores, e.g. one per compute node, that can do the I/O (on top of the usual computation). At the time of writing each new output stream (e.g. from the diagnostics package, but also the “regular” variables like T, S, Eta, etc.) is done by the next CPU in the list in a “round robin” way. This cpu then gathers the field and writes it. One can in second step do the writing asynchronously; apparently the writing does not interfere too much with the ongoing compuations. In the talk this output method was much faster than anything else, also than moving the output to extra dedicated I/O-CPUs (because of the extra network load). This applies to large simulations, where the indiviual fields still fit into the part of the memory of one node that is not used from computations (probably not the llc4320 size).<br class=""><br class="">As far as I can see, to do this one needs to replace MASTER_CPU_IO in routines like MDS_WRITE_FIELD by something like MASTER_CPU_OUT (which can default to MASTER_CPU_IO, if the method is not used), which checks if myProcId equals the “next in the list”, and then “the next in the list" needs to be passed to the gather routines, so that other than the master-process = 0 can do the gather, but I’d have to try it out (I have help for this) ...<br class=""><br class="">I am asking for your opinion, because if there are good reasons for not doing this, I will not spend any further time on this. Does this seem like something worth the effort?<br class=""><br class="">Martin<br class=""><br class="">_______________________________________________<br class="">MITgcm-devel mailing list<br class=""><a href="mailto:MITgcm-devel@mitgcm.org" class="">MITgcm-devel@mitgcm.org</a><br class="">http://mailman.mitgcm.org/mailman/listinfo/mitgcm-devel<br class=""></div></div></blockquote></div><br class=""></div></div></body></html>