[MITgcm-support] Fwd: MITgcm with PGI and Ubuntu

Martin Losch Martin.Losch at awi.de
Fri Jul 22 03:33:26 EDT 2011


Stefano,

I am by no means an expert on this (and I did not follow this thread, so I might repeat what others already said), but from your STDOUT.0000 it strikes me that you loose performance basically in 3 (really 2, see below) parts of the code:
> (PID.TID 0000.0001)  Seconds in section "SOLVE_FOR_PRESSURE  [FORWARD_STEP]":
> (PID.TID 0000.0001)          User time:   658.3895404338837
> (PID.TID 0000.0001)        System time:   617.4503868818283
> (PID.TID 0000.0001)    Wall clock time:   11155.34450268745
> 
> (PID.TID 0000.0001)  Seconds in section "CG3D   [SOLVE_FOR_PRESSURE]":
> (PID.TID 0000.0001)          User time:   646.3093758821487
> (PID.TID 0000.0001)        System time:   474.6002185344696
> (PID.TID 0000.0001)    Wall clock time:   9480.753960132599
> 
> (PID.TID 0000.0001)  Seconds in section "BLOCKING_EXCHANGES  [FORWARD_STEP]":
> (PID.TID 0000.0001)          User time:   10.78002238273621
> (PID.TID 0000.0001)        System time:   67.32977628707886
> (PID.TID 0000.0001)    Wall clock time:   2222.101120233536

cg3d is called from solve_for_pressure, so most of the problem is actually in cg3d, but I assume that cg2d has a similar problem that just does not show up, because cg2d use much less time in general. All of these routines use a lot of MPI exchanges (cg?d use global sums on top of that) and it appears that the model is waiting a lot (hence the much large Wall clock time compared to the User time). I do not know how to interpret the fairly large "System time" (usually it is 2-3 orders of magnitude smaller than the user time for solve_for_pressure and blocking_exchanges), I assume it also has something to do with the MPI communication.

But these numbers suggest that something is going wrong in the MPI part. Can you verify independently from the MITgcm that the node has a tolerable MPI implementation that can in principle work (scale)? Next, you could try to narrow down the problem by doing a performance trace analysis that will give you the exact routines where the model spends most of its time. I suspect, that it will actually be somewhere in the mpi library, but that's just a wild guess. Then there are flags in CPP_EEOPTIONS.h that might speed up the MPI part of the model (not the mpi-library)., but I hope that some-one else can give you advice on those, because they are voodoo to me.

Martin

On Jul 21, 2011, at 5:30 PM, Stefano Querin wrote:

> Hi everybody,
> 
> I sent this e-mail some days ago directly to Jean-Michel and Constantinos to avoid overloading the list with too many (and too long...) mails.  But most likely it went into spam... I don't know if somebody else is interested in these kind of issues.
> 
> In a few words, the model runs VERY slowly on a Sgi H2106 node and we think there is still something wrong in some settings.
> After receiving other colleagues' advice, we think that activating processor affinity and process-core binding could solve our problem (on this kind of architecture).
> Some people using Sgi H2106 experienced performance issues that were solved by using process placement tools (like numactl). They observed that the package Numatools (numactl command) can set affinity of processes to processors, improving runtime performance.
> 
> I'm carrying out further tests but any advice would be very useful.
> 
> Could these tools improve the performances also on other kind of platforms (maybe someone else can be interested)?
> 
> Thank you!
> 
> Cheers,
> 
> S.
> 
> P.S.: I'm sending a second part with some more statistics...




More information about the MITgcm-support mailing list