[MITgcm-support] Baroclinic instability with MPI run

Noriyuki Yamamoto nymmto at kugi.kyoto-u.ac.jp
Mon Jan 26 06:14:21 EST 2015


Hi Jean-Michel,

Sorry for my late reply.
I have tried your suggestions and begun to think that MPI may not be
the main cause of the problem.
I've got very confusing result.

Firstly, a verification experiment for 'exp 4' passed the test
(using the testreport script) with the same compiler and optfile.
With MPI, because our system needs queueing a job to run with mpi,
I manually compared dynamic field statistics in STDOUT.0000 to those
in results/output.txt.
They showed good agreements.

Secondly, I ran the model with lower compiler optimisation (-O1, -O0).
In flat-bottomed case with mpi, while with -O2 optimisation baroclinic
instability has the zonal wavenumber of nPx (or number of tiles
in east-west direction) and doesn't cascade up or down,
with -O1 and -O0 optimisation baroclinic instability doesn't occur
in the first place (at least until 400 days).
With -O2 option, the instability develops after 30 days integration.
The model is forced by westerly wind and temperature restoring along
northern and southern walls. The wind and temperature are both zonally 
constant.
The initial state is u, v, w = 0 and T = T_south (restoring
temperature along southern wall which is the coldest) everywhere.

I also tried lowering compiler optimisation with no-mpi run.
In flat-bottomed case, while -O2 run shows baloclinic instability
which cascades up by 30 days (here different from mpi run as I
mentioned in previous mail),  -O1 and -O0 runs don't show any intability 
waves.
On the other hand, in wavy (zonally sinusoidal wave) topography case,
baloclinic instability cascades up in the same way regardless of 
optimisation levels.

Tests of verification experiment for exp4 with the three levels of
optimisation also passed.
So some of my settings which aren't used in exp4 or their combination
may cause this problem?

If that helps, I also tried changing (nSx, nPx) and setting
GLOBAL_SUM_SEND_RECV in CPP_EEOPTIONS.h but with -O2 optimisation.
Runs with (nSx, nPx) = (1, 16) and (2, 8) shows the same patterns of
surface temperature evolution whose zonal wavenumber is constantly 16.

What should I do next may be a debug run?

Any suggestions will be greatly appreciated,
Noriyuki.

On 2015/01/23 0:28, Noriyuki Yamamoto wrote:
> Hi Jean-Michel,
>
> Thank you for your quick reply and suggestions!
> I'm compiling with an optfile I modified from linux_amd64_ifort+mpi_sal_oxford.
> I attach my optfile here.
>
>
>
> I'll try your suggestions tomorrow.
> Does "#define GLOBAL_SUM_SEND_RECV" create the global output files or just check MPI operations?
>
> Thanks,
> Noriyuki
>
> 2015/01/22 23:09、Jean-Michel Campin <jmc at ocean.mit.edu> のメール:
>
>> Hi Noriyuki,
>>
>> which optfile are you compiling with ?
>>
>> Otherwise, few other things here:
>>
>> 1) Although I asked Chris for full report about the set-up in order to reproduce it,
>> (easy since I have access to the same computer), to my knowledge,
>> the "Independ Tiling" problem has never been "reproducable".
>>
>> 2) One potential problem could be compiler optimisation.
>> To clarify this point, you could:
>> a) with same compiler and MPI and optfile, try to run few simple
>> verification experiments (e.g., exp4) and compare the output
>> with the reference output (e.g., exp4/results/output.txt).
>> There is a script (verification/testreport) that does that for all
>> or a sub-set of experiment and might not be too difficult to use
>> (testreport -h for a list of option).
>> b) you could try to lower the level of compiler optimisation.
>> default is "-O2" (from linux_amd64_ifort11 optfile); you could try with
>> "-O1" (it will run slower) and "-O0" (even slower).
>> If "-O0" fixes the problem, then we should try to find which
>> routine cause the problem and just compile this one with "-O0"
>> (since "-O0" for all src code is far too slow).
>>
>> 3) An other source ot problem could be the code itself. This is not
>> very likely with most standard options and pkgs (since they are
>> tested on a regular basis) but can definitively happen.
>> a) you can check if it's due to a tiling problem or MPI problem,
>> simply by running with same sNx but decreasing nPx while
>> increasing nSx (to maintain the same number of tiles = nSx*nPx).
>> If you compile with "#define GLOBAL_SUM_SEND_RECV" in CPP_EEOPTIONS.h,
>> (slower, but make the "global-sum" results independent of processors
>>   number but still dependent on domain tiling) and run the 2 cases
>> (with different nPx), you could expect to get the same results.
>> b) if all previous suggestions do not help, you could provide a
>> copy of you set-up (checkpoint64u is fairly recent) so that we will
>> try to reproduce it. Could start with your customized code dir
>> and set of parameters files (data*).
>>
>> Cheers,
>> Jean-Michel
>>
>> On Thu, Jan 22, 2015 at 09:02:18PM +0900, Noriyuki Yamamoto wrote:
>>> Hi all,
>>>
>>> I'm running into a problem with MPI run.
>>> Outputs from mpi and no-mpi run differ qualitatively.
>>>
>>> This seems to be the same problem reported in "Independ Tiling"
>>> thread (http://forge.csail.mit.edu/pipermail/mitgcm-support/2014-March/009017.html).
>>> Is there any progress about it?
>>> If not, I hope this information will add some clues to fixing up.
>>>
>>> The model has a zonally periodic channel and is forced by westerly
>>> wind and temperature restoring along the northen and southern wall
>>> in mid-high latitude.
>>> I tested some cases with different topography.
>>> In flat-bottomed case with no-mpi, baroclinic instability develops
>>> and cascades up to larger scales.
>>> But with mpi, the zonal wavenumber of baroclinic instability is
>>> fixed to nPx during the 3000-day integration (I tested nPx = 16, 20)
>>> and doesn't cascade up.
>>> In zonally wavy (sinusoidal wave whose k is not nPx) topography case
>>> with mpi using the same mpi executable file (compiled by genmake
>>> -mpi) as flat-bottomed case,
>>> baroclinic instablity cascades up and the result seems to be similar
>>> to that of no-mpi run though I checked only early surface
>>> temperature distribution.
>>>
>>> In MPI run I tried two patterns of (nPx, nPy) = (16, 4), (20,4).
>>> I compiled MITgcm codes by Intel compiler 13.1.3 with Cray MPI
>>> library 6.3.0 on SUSE Linux Enterprise Server 11 (x86_64).
>>> MITgcm version is checkpoint64u (sorry that it's no latest version).
>>> If necessary, I will attach data and SIZE.h files later.
>>>
>>> Sorry for my poor English.
>>> Noriyuki.
>>>
>>> -- 
>>> Noriyuki Yamamoto
>>> PhD Student - Physical Oceanography Group
>>> Division of Earth and Planetary Sciences,
>>> Graduate School of Science, Kyoto University.
>>> Mail:nymmto at kugi.kyoto-u.ac.jp
>>> Tel:+81-75-753-3924
>>>
>>>
>>> _______________________________________________
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support


-- 
山本紀幸

京都大学 大学院 理学研究科 地球惑星科学専攻
地球物理学教室 海洋物理学研究室 博士課程2回生
Mail:nymmto at kugi.kyoto-u.ac.jp
Tel:075-753-3924

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20150126/6985c633/attachment.htm>


More information about the MITgcm-support mailing list