[MITgcm-devel] SIZE.h matters on Columbia for CS510?
Dimitris Menemenlis
dmenemenlis at gmail.com
Fri Jul 11 12:12:32 EDT 2008
My guess (without confirmation) is that the speed-up is due to item 2
on your list, i.e., cash memory. With Chris we had similar experience
in running the 1/8th and 1/16th configurations with overall super-
linear speed-up in some cases as processor count was increased even
though exchange, global_sum, and i/o routines were slowed down. D.
Dimitris Menemenlis
DMenemenlis at gmail.com
On Jul 11, 2008, at 7:16 AM, Jean-Michel Campin wrote:
> Hello Hong,
>
> I don't know mutch, but on some platform, the "volume" of
> data which is send/received when doing an exchange
> matter in term of speed. This could be the case here (since
> Nr is not so small).
> Other thing could be better fit into the cash memory ? not sure
> of that (also difficult to tell since the number of tiles is
> different).
>
> Cheers,
> Jean-Michel
>
> On Thu, Jul 10, 2008 at 02:38:55PM -0700, Hong Zhang wrote:
>> Another interesting point for the CS510 runs:
>> The size of each tile has significant impact on the performance.
>> We have two 450-cpu runs.
>> One is that (in SIZE.h)
>> & sNx = 34,
>> & sNy = 34,
>> & OLx = 8,
>> & OLy = 8,
>> & nSx = 3,
>> & nSy = 1,
>> & nPx = 450,
>> & nPy = 1
>> Our estimate for each cpu burden is (sNx+2*OLx)*(sNy+2*OLy)/sNx/
>> sNy, ie,
>> 2.16
>> The other set is
>> & sNx = 17,
>> & sNy = 51,
>> & OLx = 8,
>> & OLy = 8,
>> & nSx = 4,
>> & nSy = 1,
>> & nPx = 450,
>> & nPy = 1,
>> then each cpu is about 2.55.
>> The first run is expected to be 15% faster than the second.
>> But the real runs show that the first run is much more than 15%.
>> Look at the time stamps for the first run:
>> 09:59 THETA.0000002232.data
>> 11:16 THETA.0000004320.data
>> 12:49 THETA.0000006552.data
>> 14:05 THETA.0000008712.data
>> about 82 minutes per monthly output.
>> While the second run is
>> 15:13 THETA.0000002232.data
>> 17:08 THETA.0000004320.data
>> 19:17 THETA.0000006552.data
>> 21:12 THETA.0000008712.data
>> about 2 hours per month.
>>
>> So the idea is that we'd better set up a square tile to improve the
>> efficiency.
>>
>> Hong Zhang wrote:
>>> Following the thread of "
>>> It seems like possible trouble with the global_sum or something like
>>> that? Would anyone have a suggestion as to what individual files we
>>> could try compiling with -O0 to proceed? "
>>> We tried many experiments by linking different object files which
>>> are
>>> generated by
>>> -O2 or -O0 options. The major candidates are global-sum and cg2d/
>>> cg3d
>>> routines,
>>> as suggested by Chris.
>>> Finally we identified that global_sum_tile.F is the problem-maker in
>>> optimization.
>>> So as a remedy, we compile this file with -O0 while all other files
>>> with -O3.
>>> And now it works for 450-cpu config.
>>> The detail of the opt-file is at
>
>>> http://mitgcm.org/cgi-bin/viewcvs.cgi/MITgcm_contrib/high_res_cube/code-mods/linux_ia64_ifort%2Bmpi_altix_nas?rev=1.11&content-type=text/vnd.viewcvs-markup
>>>
>>>
>>>
>>> thanks,
>>> hong
>>>
>>>
>>>
>>> Dimitris Menemenlis wrote:
>>>> The w2_e2setup.F and W2_EXCH2_TOPOLOGY.h that we use have been
>>>> around
>>>> for a long time and we have successfully used them on many
>>>> occasions
>>>> before.
>>>>
>>>> We have determined that the problem is most likely a compiler
>>>> optimization issue. In addition to being able to run
>>>> successfully on
>>>> 270 CPUs (but failing on 54, 216, and 450), the code will also run
>>>> successfully if we use -O0 optimization. We have tried -O0
>>>> successfully on both 54 and 450 CPUs.
>>>>
>>>> The way the model fails, when it fails, is the appearance of
>>>> randomly
>>>> distributed spikes in Eta, up to +/- 200 m, during the second time
>>>> step:
>>>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e2.pdf
>>>>
>>>> Initial Eta does not contain these spikes:
>>>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e1.pdf
>>>>
>>>> The spikes only appear in ETAN (at first). All the other model
>>>> prognostic variables (we have looked at THETA and SALT and
>>>> monitored
>>>> UVEL/VVEL) seem OK.
>>>>
>>>> The spikes are randomly distributed everywhere in the domain, i.e.,
>>>> they do not appear to be associated with edge effects of any sort.
>>>>
>>>> Has anyone ever seen a similar problem. It seems like possible
>>>> trouble with the global_sum or something like that? Would anyone
>>>> have a suggestion as to what individual files we could try
>>>> compiling
>>>> with -O0 to proceed?
>>>>
>>>> Hong and Dimitris
>>>>
>>>> On Jun 24, 2008, at 7:57 AM, Patrick Heimbach wrote:
>>>>
>>>>>
>>>>> Hi Hong,
>>>>>
>>>>> afaik,
>>>>> the files w2_e2setup.F and W2_EXCH2_TOPOLOGY.h
>>>>> are dependent on your domain decomposition on the cubed sphere,
>>>>> i.e. if you change that decomposition in SIZE.h
>>>>> (which seems to be what you did), you need to regenerate
>>>>> these two files so that all tile and face neighbor informations
>>>>> on the cube remain correct.
>>>>> The matlab script to do that is in
>>>>> utils/exch2/matlab-topology-generator/driver.m
>>>>>
>>>>> At least from your mail it sounds like you didn't do that.
>>>>> And it means your problem is not a code version problem.
>>>>>
>>>>> Hope this helps
>>>>> -Patrick
>>>>>
>>>>>
>>>>>
>>>>> On Jun 23, 2008, at 7:13 PM, Hong Zhang wrote:
>>>>>
>>>>>> Dear all,
>>>>>> last lime we reported a problem (attached here:
>>>>>> ---------
>>>>>> Something has happened to code from checkpoint59l to current head
>>>>>> branch, which makes it impossible to restart CS510 code. Any
>>>>>> clues where we should look and what chekpoints to test?
>>>>>> Job crashes on third time step with
>>>>>>
>>>>>>> WARNING: r*FacC < hFacInf at 3 pts : bi,bj,Thid,Iter= 1
>>>>>>> 1 1 218
>>>>>>> e.g. at i,j= 65 85 ; rStarFac,H,eta = -1.237739 4.755480E+03
>>>>>>> -1.064152E+04
>>>>>>> STOP in CALC_R_STAR : too SMALL rStarFacC !
>>>>>>>
>>>>>> ---------
>>>>>> We found this problem is related to the config of SIZE.h and
>>>>>> w2_e2setup.F
>>>>>> We tested s216t_85x85/SIZE.h_216, s1800t_17x51/SIZE.h_450, and
>>>>>> s216t_85x85/SIZE.h_54.
>>>>>> They all failed and caused the same error as mentioned above.
>>>>>> But the config of s1350t_34x34/SIZE_270.h is workable.
>>>>>> For s216t_85x85/SIZE.h_54 we further switched off the
>>>>>> optimization
>>>>>> (in Makefile setting FOPTIM =) but it has the same problem.
>>>>>> We checked the output @second timestep
>>>>>> but didn't find obvious overlap problem.
>>>>>> Does anyone have any clue?
>>>>>>
>>>>>> hong
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>
>>>>> ---
>>>>> Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/~heimbach
>>>>> MIT | EAPS 54-1518 | 77 Massachusetts Ave | Cambridge MA 02139 USA
>>>>> FON +1-617-253-5259 | FAX +1-617-253-4464 | SKYPE patrick.heimbach
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>>
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list