[MITgcm-devel] SIZE.h matters on Columbia for CS510?
Hong Zhang
hong.zhang at caltech.edu
Thu Jul 10 17:38:55 EDT 2008
Another interesting point for the CS510 runs:
The size of each tile has significant impact on the performance.
We have two 450-cpu runs.
One is that (in SIZE.h)
& sNx = 34,
& sNy = 34,
& OLx = 8,
& OLy = 8,
& nSx = 3,
& nSy = 1,
& nPx = 450,
& nPy = 1
Our estimate for each cpu burden is (sNx+2*OLx)*(sNy+2*OLy)/sNx/sNy, ie,
2.16
The other set is
& sNx = 17,
& sNy = 51,
& OLx = 8,
& OLy = 8,
& nSx = 4,
& nSy = 1,
& nPx = 450,
& nPy = 1,
then each cpu is about 2.55.
The first run is expected to be 15% faster than the second.
But the real runs show that the first run is much more than 15%.
Look at the time stamps for the first run:
09:59 THETA.0000002232.data
11:16 THETA.0000004320.data
12:49 THETA.0000006552.data
14:05 THETA.0000008712.data
about 82 minutes per monthly output.
While the second run is
15:13 THETA.0000002232.data
17:08 THETA.0000004320.data
19:17 THETA.0000006552.data
21:12 THETA.0000008712.data
about 2 hours per month.
So the idea is that we'd better set up a square tile to improve the
efficiency.
Hong Zhang wrote:
> Following the thread of "
> It seems like possible trouble with the global_sum or something like
> that? Would anyone have a suggestion as to what individual files we
> could try compiling with -O0 to proceed? "
> We tried many experiments by linking different object files which are
> generated by
> -O2 or -O0 options. The major candidates are global-sum and cg2d/cg3d
> routines,
> as suggested by Chris.
> Finally we identified that global_sum_tile.F is the problem-maker in
> optimization.
> So as a remedy, we compile this file with -O0 while all other files
> with -O3.
> And now it works for 450-cpu config.
> The detail of the opt-file is at
> http://mitgcm.org/cgi-bin/viewcvs.cgi/MITgcm_contrib/high_res_cube/code-mods/linux_ia64_ifort%2Bmpi_altix_nas?rev=1.11&content-type=text/vnd.viewcvs-markup
>
>
> thanks,
> hong
>
>
>
> Dimitris Menemenlis wrote:
>> The w2_e2setup.F and W2_EXCH2_TOPOLOGY.h that we use have been around
>> for a long time and we have successfully used them on many occasions
>> before.
>>
>> We have determined that the problem is most likely a compiler
>> optimization issue. In addition to being able to run successfully on
>> 270 CPUs (but failing on 54, 216, and 450), the code will also run
>> successfully if we use -O0 optimization. We have tried -O0
>> successfully on both 54 and 450 CPUs.
>>
>> The way the model fails, when it fails, is the appearance of randomly
>> distributed spikes in Eta, up to +/- 200 m, during the second time step:
>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e2.pdf
>>
>> Initial Eta does not contain these spikes:
>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e1.pdf
>>
>> The spikes only appear in ETAN (at first). All the other model
>> prognostic variables (we have looked at THETA and SALT and monitored
>> UVEL/VVEL) seem OK.
>>
>> The spikes are randomly distributed everywhere in the domain, i.e.,
>> they do not appear to be associated with edge effects of any sort.
>>
>> Has anyone ever seen a similar problem. It seems like possible
>> trouble with the global_sum or something like that? Would anyone
>> have a suggestion as to what individual files we could try compiling
>> with -O0 to proceed?
>>
>> Hong and Dimitris
>>
>> On Jun 24, 2008, at 7:57 AM, Patrick Heimbach wrote:
>>
>>>
>>> Hi Hong,
>>>
>>> afaik,
>>> the files w2_e2setup.F and W2_EXCH2_TOPOLOGY.h
>>> are dependent on your domain decomposition on the cubed sphere,
>>> i.e. if you change that decomposition in SIZE.h
>>> (which seems to be what you did), you need to regenerate
>>> these two files so that all tile and face neighbor informations
>>> on the cube remain correct.
>>> The matlab script to do that is in
>>> utils/exch2/matlab-topology-generator/driver.m
>>>
>>> At least from your mail it sounds like you didn't do that.
>>> And it means your problem is not a code version problem.
>>>
>>> Hope this helps
>>> -Patrick
>>>
>>>
>>>
>>> On Jun 23, 2008, at 7:13 PM, Hong Zhang wrote:
>>>
>>>> Dear all,
>>>> last lime we reported a problem (attached here:
>>>> ---------
>>>> Something has happened to code from checkpoint59l to current head
>>>> branch, which makes it impossible to restart CS510 code. Any clues
>>>> where we should look and what chekpoints to test?
>>>> Job crashes on third time step with
>>>>
>>>>> WARNING: r*FacC < hFacInf at 3 pts : bi,bj,Thid,Iter= 1
>>>>> 1 1 218
>>>>> e.g. at i,j= 65 85 ; rStarFac,H,eta = -1.237739 4.755480E+03
>>>>> -1.064152E+04
>>>>> STOP in CALC_R_STAR : too SMALL rStarFacC !
>>>>>
>>>> ---------
>>>> We found this problem is related to the config of SIZE.h and
>>>> w2_e2setup.F
>>>> We tested s216t_85x85/SIZE.h_216, s1800t_17x51/SIZE.h_450, and
>>>> s216t_85x85/SIZE.h_54.
>>>> They all failed and caused the same error as mentioned above.
>>>> But the config of s1350t_34x34/SIZE_270.h is workable.
>>>> For s216t_85x85/SIZE.h_54 we further switched off the optimization
>>>> (in Makefile setting FOPTIM =) but it has the same problem.
>>>> We checked the output @second timestep
>>>> but didn't find obvious overlap problem.
>>>> Does anyone have any clue?
>>>>
>>>> hong
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> ---
>>> Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/~heimbach
>>> MIT | EAPS 54-1518 | 77 Massachusetts Ave | Cambridge MA 02139 USA
>>> FON +1-617-253-5259 | FAX +1-617-253-4464 | SKYPE patrick.heimbach
>>>
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>>
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
More information about the MITgcm-devel
mailing list