[MITgcm-devel] SIZE.h matters on Columbia for CS510?
Jean-Michel Campin
jmc at ocean.mit.edu
Fri Jul 11 10:16:21 EDT 2008
Hello Hong,
I don't know mutch, but on some platform, the "volume" of
data which is send/received when doing an exchange
matter in term of speed. This could be the case here (since
Nr is not so small).
Other thing could be better fit into the cash memory ? not sure
of that (also difficult to tell since the number of tiles is
different).
Cheers,
Jean-Michel
On Thu, Jul 10, 2008 at 02:38:55PM -0700, Hong Zhang wrote:
> Another interesting point for the CS510 runs:
> The size of each tile has significant impact on the performance.
> We have two 450-cpu runs.
> One is that (in SIZE.h)
> & sNx = 34,
> & sNy = 34,
> & OLx = 8,
> & OLy = 8,
> & nSx = 3,
> & nSy = 1,
> & nPx = 450,
> & nPy = 1
> Our estimate for each cpu burden is (sNx+2*OLx)*(sNy+2*OLy)/sNx/sNy, ie,
> 2.16
> The other set is
> & sNx = 17,
> & sNy = 51,
> & OLx = 8,
> & OLy = 8,
> & nSx = 4,
> & nSy = 1,
> & nPx = 450,
> & nPy = 1,
> then each cpu is about 2.55.
> The first run is expected to be 15% faster than the second.
> But the real runs show that the first run is much more than 15%.
> Look at the time stamps for the first run:
> 09:59 THETA.0000002232.data
> 11:16 THETA.0000004320.data
> 12:49 THETA.0000006552.data
> 14:05 THETA.0000008712.data
> about 82 minutes per monthly output.
> While the second run is
> 15:13 THETA.0000002232.data
> 17:08 THETA.0000004320.data
> 19:17 THETA.0000006552.data
> 21:12 THETA.0000008712.data
> about 2 hours per month.
>
> So the idea is that we'd better set up a square tile to improve the
> efficiency.
>
> Hong Zhang wrote:
>> Following the thread of "
>> It seems like possible trouble with the global_sum or something like
>> that? Would anyone have a suggestion as to what individual files we
>> could try compiling with -O0 to proceed? "
>> We tried many experiments by linking different object files which are
>> generated by
>> -O2 or -O0 options. The major candidates are global-sum and cg2d/cg3d
>> routines,
>> as suggested by Chris.
>> Finally we identified that global_sum_tile.F is the problem-maker in
>> optimization.
>> So as a remedy, we compile this file with -O0 while all other files
>> with -O3.
>> And now it works for 450-cpu config.
>> The detail of the opt-file is at
>> http://mitgcm.org/cgi-bin/viewcvs.cgi/MITgcm_contrib/high_res_cube/code-mods/linux_ia64_ifort%2Bmpi_altix_nas?rev=1.11&content-type=text/vnd.viewcvs-markup
>>
>>
>>
>> thanks,
>> hong
>>
>>
>>
>> Dimitris Menemenlis wrote:
>>> The w2_e2setup.F and W2_EXCH2_TOPOLOGY.h that we use have been around
>>> for a long time and we have successfully used them on many occasions
>>> before.
>>>
>>> We have determined that the problem is most likely a compiler
>>> optimization issue. In addition to being able to run successfully on
>>> 270 CPUs (but failing on 54, 216, and 450), the code will also run
>>> successfully if we use -O0 optimization. We have tried -O0
>>> successfully on both 54 and 450 CPUs.
>>>
>>> The way the model fails, when it fails, is the appearance of randomly
>>> distributed spikes in Eta, up to +/- 200 m, during the second time
>>> step:
>>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e2.pdf
>>>
>>> Initial Eta does not contain these spikes:
>>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e1.pdf
>>>
>>> The spikes only appear in ETAN (at first). All the other model
>>> prognostic variables (we have looked at THETA and SALT and monitored
>>> UVEL/VVEL) seem OK.
>>>
>>> The spikes are randomly distributed everywhere in the domain, i.e.,
>>> they do not appear to be associated with edge effects of any sort.
>>>
>>> Has anyone ever seen a similar problem. It seems like possible
>>> trouble with the global_sum or something like that? Would anyone
>>> have a suggestion as to what individual files we could try compiling
>>> with -O0 to proceed?
>>>
>>> Hong and Dimitris
>>>
>>> On Jun 24, 2008, at 7:57 AM, Patrick Heimbach wrote:
>>>
>>>>
>>>> Hi Hong,
>>>>
>>>> afaik,
>>>> the files w2_e2setup.F and W2_EXCH2_TOPOLOGY.h
>>>> are dependent on your domain decomposition on the cubed sphere,
>>>> i.e. if you change that decomposition in SIZE.h
>>>> (which seems to be what you did), you need to regenerate
>>>> these two files so that all tile and face neighbor informations
>>>> on the cube remain correct.
>>>> The matlab script to do that is in
>>>> utils/exch2/matlab-topology-generator/driver.m
>>>>
>>>> At least from your mail it sounds like you didn't do that.
>>>> And it means your problem is not a code version problem.
>>>>
>>>> Hope this helps
>>>> -Patrick
>>>>
>>>>
>>>>
>>>> On Jun 23, 2008, at 7:13 PM, Hong Zhang wrote:
>>>>
>>>>> Dear all,
>>>>> last lime we reported a problem (attached here:
>>>>> ---------
>>>>> Something has happened to code from checkpoint59l to current head
>>>>> branch, which makes it impossible to restart CS510 code. Any
>>>>> clues where we should look and what chekpoints to test?
>>>>> Job crashes on third time step with
>>>>>
>>>>>> WARNING: r*FacC < hFacInf at 3 pts : bi,bj,Thid,Iter= 1
>>>>>> 1 1 218
>>>>>> e.g. at i,j= 65 85 ; rStarFac,H,eta = -1.237739 4.755480E+03
>>>>>> -1.064152E+04
>>>>>> STOP in CALC_R_STAR : too SMALL rStarFacC !
>>>>>>
>>>>> ---------
>>>>> We found this problem is related to the config of SIZE.h and
>>>>> w2_e2setup.F
>>>>> We tested s216t_85x85/SIZE.h_216, s1800t_17x51/SIZE.h_450, and
>>>>> s216t_85x85/SIZE.h_54.
>>>>> They all failed and caused the same error as mentioned above.
>>>>> But the config of s1350t_34x34/SIZE_270.h is workable.
>>>>> For s216t_85x85/SIZE.h_54 we further switched off the optimization
>>>>> (in Makefile setting FOPTIM =) but it has the same problem.
>>>>> We checked the output @second timestep
>>>>> but didn't find obvious overlap problem.
>>>>> Does anyone have any clue?
>>>>>
>>>>> hong
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> ---
>>>> Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/~heimbach
>>>> MIT | EAPS 54-1518 | 77 Massachusetts Ave | Cambridge MA 02139 USA
>>>> FON +1-617-253-5259 | FAX +1-617-253-4464 | SKYPE patrick.heimbach
>>>>
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>>
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list