[MITgcm-support] Parallel performance

Ed Hill ed at eh3.com
Thu Jun 16 10:11:40 EDT 2005


On Thu, 2005-06-16 at 15:17 +0200, Stefano Querin wrote:
> Hi Kevin,
> I had similar problems when I ported the code from an IBM SP4 to a linux 
> cluster that seems similar to yours (Opteron 64, Rocks OS, Sun Grid Engine, 
> ethernet network). Basically the bad performances when using multiple procs 
> were due to I/O problems. SP4 has a very fast I/O scratch filesystem so I 
> could make the model read and write files on that disk without problems 
> (also using the GlobalFiles=.TRUE. option in PARM01 in the data file). The 
> same configuration on the new cluster using NFS filesystem gave rise to very 
> slow simulations. I solved the problem simply splitting the run on every 
> CPU: before the beginnig of the simulation I copy the input files and the 
> executable on each disk (via scp) then I start the run with 
> GlobalFiles=.FALSE. and, at the end of the simulation, I copy the output 
> back to the front-end node.
> Are you using the GlobalFiles=.TRUE. option in PARM01 in the data file?
> Maybe this is not the solution for your problem but a check on the I/O could 
> be useful!
> Good luck!
> 
> Stefano
> 
> P.S.: updating the code to more recent versions is not so difficult and VERY 
> useful, I suggest you to try a newer checkpoint (lots of new features and 
> bugs fixed)!


Hi Kevin & Stefano,

Yes, all of Kevin's advice is good!  IO is a very common bottleneck.
Its a typical problem on clusters running NFS over Ethernet.  So
anything that you can do to:

  + reduce the overall IO volume
  + use local (per-node) disks instead of NFS,  and/or
  + use something like Lustre or PVFS2 instead of NFS

can, potentially, make a big difference.

Also, we recommend that you stay current with your MITgcm version.  The
"head" (or very latest) of our CVS tree has been usable for many months
and we generally recommend it or, to be safer, the most recent tagged
checkpoint which you can find by looking at the top of:

http://mitgcm.org/cgi-bin/viewcvs.cgi/MITgcm/doc/tag-index?rev=HEAD

where the latest one is currently called "checkpoint57h_done" and you
can get it from CVS using:

  cvs co -r checkpoint57h_done MITgcm

Ed

-- 
Edward H. Hill III, PhD
office:  MIT Dept. of EAPS;  Rm 54-1424;  77 Massachusetts Ave.
             Cambridge, MA 02139-4307
emails:  eh3 at mit.edu                ed at eh3.com
URLs:    http://web.mit.edu/eh3/    http://eh3.com/
phone:   617-253-0098
fax:     617-253-4464




More information about the MITgcm-support mailing list