[MITgcm-support] (no subject)
Malte Jansen
mfj at uchicago.edu
Thu Jul 14 14:26:36 EDT 2016
Steve,
At the end of the run the model should produce a summary of how much time was spent doing what, which should be written in the STDOUT file. It should look something like what I pasted below. It might be helpful to look at that. (In addition to all the things Jody already pointed out.)
Cheers,
Malte
------------------------------------------------------
Malte F Jansen
Assistant Professor
Department of the Geophysical Sciences
The University of Chicago
5734 South Ellis Avenue
Chicago, IL 60637 USA
(PID.TID 0000.0001) %CHECKPOINT 21900000 0021900000
(PID.TID 0000.0001) Seconds in section "ALL [THE_MODEL_MAIN]":
(PID.TID 0000.0001) User time: 81553.5515705012
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 83194.7516629696
(PID.TID 0000.0001) No. starts: 1
(PID.TID 0000.0001) No. stops: 1
(PID.TID 0000.0001) Seconds in section "INITIALISE_FIXED [THE_MODEL_MAIN]":
(PID.TID 0000.0001) User time: 0.133980002254248
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 1.00463199615479
(PID.TID 0000.0001) No. starts: 1
(PID.TID 0000.0001) No. stops: 1
(PID.TID 0000.0001) Seconds in section "THE_MAIN_LOOP [THE_MODEL_MAIN]":
(PID.TID 0000.0001) User time: 81553.4175904989
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 83193.7469999790
(PID.TID 0000.0001) No. starts: 1
(PID.TID 0000.0001) No. stops: 1
(PID.TID 0000.0001) Seconds in section "INITIALISE_VARIA [THE_MAIN_LOOP]":
(PID.TID 0000.0001) User time: 7.298800349235535E-002
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 0.855633974075317
(PID.TID 0000.0001) No. starts: 1
(PID.TID 0000.0001) No. stops: 1
(PID.TID 0000.0001) Seconds in section "MAIN LOOP [THE_MAIN_LOOP]":
(PID.TID 0000.0001) User time: 81553.3446024954
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 83192.8913462162
(PID.TID 0000.0001) No. starts: 1
(PID.TID 0000.0001) No. stops: 1
(PID.TID 0000.0001) Seconds in section "MAIN_DO_LOOP [THE_MAIN_LOOP]":
(PID.TID 0000.0001) User time: 81545.1413792670
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 83145.3943319321
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "FORWARD_STEP [MAIN_DO_LOOP]":
(PID.TID 0000.0001) User time: 81528.1345684826
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 83048.1237185001
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "DO_STATEVARS_DIAGS [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 1078.10327297449
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 2827.88697195053
(PID.TID 0000.0001) No. starts: 14600000
(PID.TID 0000.0001) No. stops: 14600000
(PID.TID 0000.0001) Seconds in section "LOAD_FIELDS_DRIVER [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 876.897960990667
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 1301.42414379120
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "EXTERNAL_FLDS_LOAD [LOAD_FLDS_DRIVER]":
(PID.TID 0000.0001) User time: 28.5411767959595
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 49.7934198379517
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "RBCS_FIELDS_LOAD [I/O]":
(PID.TID 0000.0001) User time: 754.574070543051
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 1100.32427453995
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "DO_ATMOSPHERIC_PHYS [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 35.8292319774628
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 49.0405611991882
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "DO_OCEANIC_PHYS [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 8746.01205241680
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 9023.84966444969
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "THERMODYNAMICS [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 11385.7259706557
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 11436.7402439117
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "DYNAMICS [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 14802.8334793448
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 14765.4700200558
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "UPDATE_SURF_DR [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 261.719356000423
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 286.261898517609
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "SOLVE_FOR_PRESSURE [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 34763.6573372781
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 36091.8884937763
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "MOM_CORRECTION_STEP [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 501.247010856867
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 506.126838922501
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "INTEGR_CONTINUITY [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 765.209014028311
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 747.568499565125
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "TRC_CORRECTION_STEP [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 47.1817142963409
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 49.4994082450867
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "BLOCKING_EXCHANGES [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 7571.84065386653
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 4694.35633111000
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "DO_STATEVARS_TAVE [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 6.13304901123047
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 49.2227129936218
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "MONITOR [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 112.063203334808
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 214.197835445404
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "DO_THE_MODEL_IO [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 9.37701416015625
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 62.9506850242615
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) Seconds in section "DO_WRITE_PICKUP [FORWARD_STEP]":
(PID.TID 0000.0001) User time: 8.42355561256409
(PID.TID 0000.0001) System time: 0.000000000000000E+000
(PID.TID 0000.0001) Wall clock time: 53.2940809726715
(PID.TID 0000.0001) No. starts: 7300000
(PID.TID 0000.0001) No. stops: 7300000
(PID.TID 0000.0001) // ======================================================
(PID.TID 0000.0001) // Tile <-> Tile communication statistics
(PID.TID 0000.0001) // ======================================================
(PID.TID 0000.0001) // o Tile number: 000001
(PID.TID 0000.0001) // No. X exchanges = 0
(PID.TID 0000.0001) // Max. X spins = 0
(PID.TID 0000.0001) // Min. X spins = 1000000000
(PID.TID 0000.0001) // Total. X spins = 0
(PID.TID 0000.0001) // Avg. X spins = 0.00E+00
(PID.TID 0000.0001) // No. Y exchanges = 0
(PID.TID 0000.0001) // Max. Y spins = 0
(PID.TID 0000.0001) // Min. Y spins = 1000000000
(PID.TID 0000.0001) // Total. Y spins = 0
(PID.TID 0000.0001) // Avg. Y spins = 0.00E+00
(PID.TID 0000.0001) // o Thread number: 000001
(PID.TID 0000.0001) // No. barriers = 2103458542
(PID.TID 0000.0001) // Max. barrier spins = 1
(PID.TID 0000.0001) // Min. barrier spins = 1
(PID.TID 0000.0001) // Total barrier spins = 2103458542
(PID.TID 0000.0001) // Avg. barrier spins = 1.00E+00
PROGRAM MAIN: Execution ended Normally
On Jul 14, 2016, at 10:45 AM, Stephen Cousins <steve.cousins at maine.edu<mailto:steve.cousins at maine.edu>> wrote:
Hi,
I'm trying to help researchers from the University of Maine to run MITgcm. The model runs they think it should run much faster.
I have run or helped run many models while working for the Ocean Modeling Group however this is the first time I have encountered MITgcm.
With Rutgers ROMS there is a method of running a number of tiles per sub-domain and it seems that MITgcm can do that too. The reason for doing so with ROMS was (I believe) to try to get the tiles to fit in cache to increase performance. Is that the reason for doing so with MITgcm? We have tried a number of combinations with not much luck.
For testing, the full domain we have is 600 x 520 x 21 using 64 processes and getting only 30 time steps per minute. I wondered if the domain was too small for that many processes so I reduced the number of processes but that didn't help. The plan is to triple the resolution in each horizontal direction and double in the vertical.
Our cluster has nodes with Intel E5-2600v3 processors totaling 24 cores per node with FDR-10 Infiniband. The way the jobs were specified, some compute nodes had many processes (like 20) on them and some had only 1 or 2. I experimented and found that by using only 4 cores per node and only 48 cores, it ran close to twice as fast as with 64 cores and a mix of the numbers of cores per node. To me this indicates that the inter-process-communication is high and it is saturating the memory bandwidth of the nodes with large process counts. That might point to the subdomains being too small halo region being a significant proportion of the subdomain) but in that case when I decreased the run to 16 cores I would have thought that it would have improved things quite a bit. I haven't profiled the code yet. I thought it might be quicker to write to you to get some information first.
Can you please explain what the optimal layout is for performance? Is there an optimal size subdomain that you know of for these processors? Optimal number of tiles per subdomain? Also can you explain at a somewhat high level any other factors to consider when running the model to get better performance? Also, are there Intel Haswell CPU-specific compiler flags (we're using the Intel compilers with MVAPICH2) that you can recommend to us? Finally, is there a benchmark case where we can verify that we are getting the expected performance?
Thanks very much,
Steve
--
________________________________________________________________
Steve Cousins Supercomputer Engineer/Administrator
Advanced Computing Group University of Maine System
244 Neville Hall (UMS Data Center) (207) 561-3574
Orono ME 04469 steve.cousins at maine.edu<http://maine.edu/>
_______________________________________________
MITgcm-support mailing list
MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
http://mitgcm.org/mailman/listinfo/mitgcm-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20160714/89c952cd/attachment-0001.htm>
More information about the MITgcm-support
mailing list