[MITgcm-support] Coupled model running!

Jean-Michel Campin jmc at ocean.mit.edu
Thu Oct 11 18:32:46 EDT 2012


Hi Taimaz,

I have added today some "flush" call which might help figuring out 
where the problem is.
If you update your code and add
> debugMode=.TRUE.,
in both rank_1/eedata and rank_2/eedata,
we should get the last messages printed to the standard output.

If you could try again (and send me, for each case, the 2 STDOUT.0000, 
as you did yesterday):
a) with the default (short) integration length (5/40 time-steps)
b) and also with the 0 time-step run 

Also, worth to check that the 2 STDERR.0000 are empty.

Cheers,
Jean-Michel

On Wed, Oct 10, 2012 at 02:06:12PM -0400, Jean-Michel Campin wrote:
> Hi Taimaz,
> 
> What I found curious (and unexplained) is that
> - trying to run for 5/40 ocean/atmos time step, the atmos (rank_2)
>   already start the main iteration loop before spinning:
> > > > (PID.TID 0000.0001) SOLVE_FOR_PRESSURE: putPmEinXvector =    F
> > > >  cg2d: Sum(rhs),rhsMax =   3.97903932025656E-12  4.56406615217961E+03
> > > >  cg2d: Sum(rhs),rhsMax =  -1.00612851383630E-11  9.58133938764946E+03
>   whereas his ocean does not seem to reach this stage (still in initialisation part).
> - but trying to run for zero iteration (your last test), the ocean finishes
>   but, apparently, the atmos did not reach the end of the initialisation
> 
> Presently looking at how to add a "call flush" to help debuging.
> 
> Cheers,
> Jean-Michel
> 
> On Wed, Oct 10, 2012 at 02:14:08PM -0230, taimaz.bahadory wrote:
> > Hi Jean;
> > 
> > Nothing is changed other than the netCDF options. I have also tried running
> > without netCDF enabled, but the same problem occurred.
> > It's still stuck. Here I attached those three files again, plus the
> > "Coupler.0000.clog" file from "rank_0" directory.
> > The program is still running (consuming CPUs) but these files are not
> > getting updated anymore, nor do other files.
> > Even I can send the whole directory if you wish!
> > 
> > 
> > 
> > 
> > On Wed, Oct 10, 2012 at 1:07 PM, Jean-Michel Campin <jmc at ocean.mit.edu>wrote:
> > 
> > > Hi Taimaz,
> > >
> > > Apart from turning on "useMNC=.TRUE.,", did you changed some other
> > > parameters/files from the standard sep-up in verification/cpl_aim+ocn ?
> > >
> > > And one thing you could try (since the atmosphere seems to get to the 1srt
> > > iter, but hard to tell anyting about the ocean since could be due to
> > > not flushing the buffer often enough) would be to run for zero iteration,
> > > changing:
> > > rank_0/data, 1rst line, replacing 5 with 0
> > > rank_1/data, replacing nTimeSteps=5,  with nTimeSteps=0,
> > > rank_2/data, replacing nTimeSteps=40, with nTimeSteps=0,
> > >
> > > Cheers,
> > > Jean-Michel
> > >
> > > On Tue, Oct 09, 2012 at 06:11:27PM -0230, taimaz.bahadory wrote:
> > > > Here is the "std_outp" contents:
> > > >
> > > >  CPL_READ_PARAMS: nCouplingSteps=           5
> > > >  runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap=  3644
> > > >  ROmap:    1  599  598 0.100280
> > > >  ROmap: 3644 4402 4403 0.169626
> > > >
> > > >
> > > > and the last lines of two STDOUT files:
> > > >
> > > > rank_1:
> > > > ...
> > > > ...
> > > >
> > > ++++++++++++++++++++++vtrv++++++....++++++++++++++++++++++++yxxyyyyy....q++py++zyxzzzzzyxxxwsuwwxxxx++++....+++++++++++++++++++++++yyxyzzzyy....+++++
> > > > +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> > > > (PID.TID 0000.0001)      4
> > > >
> > > ..++++++++++++++++++++++yyy+++++++....++++++++++++++++++++++++yyyzzzzz....snnny++zyyyyzzywvtuvtvwwxxwx+++y....++++++++++++++++++w+++xxyyzzzzzz....+++++
> > > > +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> > > > (PID.TID 0000.0001)      3
> > > >
> > > ..++++++++++++++++++++++++++++++++....+++++++++++++++++++++++++++++++z....tmnvz+zzzyxxxxwqpnotsuvwyxwy++++....y++++++++++++++++wv+++zyzzzzzyz+....+++++
> > > > +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> > > > (PID.TID 0000.0001)      2
> > > >
> > > ..++++++++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++....unny+zzzzzxxwwrnmmmqsuvvxxw+++++....y++++++++++++++++w++++zyyz+zyyy+....+++++
> > > > +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> > > > (PID.TID 0000.0001)      1
> > > >
> > > ..++++++++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++....uooy+yyzzyxwwqoomlmqtvutwyy+++++....y++++++++++++++++y++++zyyyzzzzz+....+++++
> > > > +++++++++++++++++++++++++++....++++++++++++++++++y+++++++++++++..
> > > >
> > > >
> > > > rank_2:
> > > > ...
> > > > ...
> > > > (PID.TID 0000.0001) // Model current state
> > > > (PID.TID 0000.0001) //
> > > > =======================================================
> > > > (PID.TID 0000.0001)
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: albedo_cs32.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > > vegetFrc.cpl_FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > > seaSurfT.cpl_FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > > seaSurfT.cpl_FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > > lndSurfT.cpl_FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > > lndSurfT.cpl_FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > seaIce.cpl3FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > seaIce.cpl3FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > > snowDepth.cpl_FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > > snowDepth.cpl_FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > > soilMoist.cpl_FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > > soilMoist.cpl_FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > > soilMoist.cpl_FM.bin
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > > soilMoist.cpl_FM.bin
> > > > (PID.TID 0000.0001) SOLVE_FOR_PRESSURE: putPmEinXvector =    F
> > > >  cg2d: Sum(rhs),rhsMax =   3.97903932025656E-12  4.56406615217961E+03
> > > >  cg2d: Sum(rhs),rhsMax =  -1.00612851383630E-11  9.58133938764946E+03
> > > >
> > > >
> > > >
> > > > Thanks again for your time
> > > >
> > > >
> > > >
> > > > On Tue, Oct 9, 2012 at 5:01 PM, Jean-Michel Campin <jmc at ocean.mit.edu
> > > >wrote:
> > > >
> > > > > Hi Taimaz,
> > > > >
> > > > > Looks like everything is normal. The next thing to check is:
> > > > >
> > > > > Could you check and send the content of file:
> > > > >   cpl_aim+ocn/std_out
> > > > > which is the standard output of mpirun ?
> > > > >
> > > > > And can you also check the last part of ocean & atmos STDOUT:
> > > > >  rank_1/STDOUT.0000
> > > > >  rank_2/STDOUT.0000
> > > > > to see where each component is stuck ?
> > > > >
> > > > > Jean-Michel
> > > > >
> > > > > On Tue, Oct 09, 2012 at 02:35:48PM -0230, taimaz.bahadory wrote:
> > > > > > Thank for your complete reply;
> > > > > >
> > > > > > 1) Yes, I did. I have run the "aim.5l_LatLon" before with MPI
> > > enabled (40
> > > > > > CPUs) without any problem
> > > > > > 2) I've found out it before, and disabled the whole section of
> > > finding
> > > > > the
> > > > > > optfile, and replaced it with mine, which refers to a modified
> > > > > > "linux_amd64_gfortran" opt-file which points to my correct MPI and
> > > netCDF
> > > > > > addresses (I used it for my previous runs too with no error)
> > > > > > 3) Here is the only thing printed on my screen after running
> > > > > > "./run_cpl_test 3":
> > > > > >
> > > > > >    /home/tbahador/programs/MITgcm/verification/cpl_aim+ocn/run/tt
> > > > > >    execute 'mpirun -np 1 ./build_cpl/mitgcmuv : -np 1
> > > > > ./build_ocn/mitgcmuv
> > > > > > : -np 1 ./build_atm/mitgcmuv' :
> > > > > >
> > > > > > and it freezes there then.
> > > > > > But as I check the three "rank" directories, there are mnc_*
> > > directories
> > > > > > and some other output files generated there, which shows that they
> > > are
> > > > > > initially created by, say, MPI, but no further update to none of
> > > them!
> > > > > Here
> > > > > > is where I'm stuck in.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 9, 2012 at 12:21 PM, Jean-Michel Campin <
> > > jmc at ocean.mit.edu
> > > > > >wrote:
> > > > > >
> > > > > > > Hi Taimaz,
> > > > > > >
> > > > > > > The coupled set-up is used by several users on different
> > > > > > > platforms, so we should find a way for you to run it.
> > > > > > > But regarding the script "run_cpl_test" in
> > > verification/cpl_aim+ocn/
> > > > > > > it has not been used so much (plus it pre-date some changes
> > > > > > > in genmake2) and could have been better written.
> > > > > > >
> > > > > > > So, will need to check each step to see where the problem is.
> > > > > > >
> > > > > > > 1) have you tried to run a simple (i.e., not coupled) verification
> > > > > > >   experiment using MPI ? this would confirm that libs and mpirun
> > > > > > >   are working well on your platform.
> > > > > > >
> > > > > > > 2) need to check which optfile is being used (run_cpl_test is not
> > > > > > >   well written regarding this optfile selection and it expects an
> > > > > > >   optfile "*+mpi" in verification directory !).
> > > > > > >   "run_cpl_test 2" command should write it as:
> > > > > > >   >  Using optfile: OPTFILE_NAME (compiler=COMPILER_NAME)
> > > > > > >   Might be useful also to sent the 1rst 100 lines of
> > > build_atm/Makefile
> > > > > > >   just to check.
> > > > > > >
> > > > > > > 3) need to check if run_cpl_test recognizes an OpenMPI built and
> > > > > > >   proceeds with the right command.
> > > > > > >   Could you send all the printed information that "run_cpl_test 3"
> > > > > > >   is producing ? the command should be printed as:
> > > > > > >   > execute 'mpirun ...
> > > > > > >   In my case, a successful run using OpenMPI gives me:
> > > > > > > > execute 'mpirun -np 1 ./build_cpl/mitgcmuv : -np 1
> > > > > ./build_ocn/mitgcmuv
> > > > > > > : -np 1 ./build_atm/mitgcmuv' :
> > > > > > > >  MITCPLR_init1:            2  UV-Atmos MPI_Comm_create
> > > > > > > MPI_COMM_compcplr=           6  ierr=           0
> > > > > > > >  MITCPLR_init1:            2  UV-Atmos component num=           2
> > > > > > >  MPI_COMM=           5           6
> > > > > > > >  MITCPLR_init1:            2  UV-Atmos Rank/Size =            1
> > >  /
> > > > > > >     2
> > > > > > > >  MITCPLR_init1:            1  UV-Ocean Rank/Size =            1
> > >  /
> > > > > > >     2
> > > > > > > >  CPL_READ_PARAMS: nCouplingSteps=           5
> > > > > > > >  runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap=  3644
> > > > > > > >  ROmap:    1  599  598 0.100280
> > > > > > > >  ROmap: 3644 4402 4403 0.169626
> > > > > > > >   Exporting (pid=    0 ) atmospheric fluxes at iter.         0
> > > > > > > >   Importing (pid=    0 ) oceanic fields at iteration         0
> > > > > > > >   Exporting (pid=    0 ) atmospheric fluxes at iter.         8
> > > > > > > >   Importing (pid=    0 ) oceanic fields at iteration         8
> > > > > > > >   Exporting (pid=    0 ) atmospheric fluxes at iter.        16
> > > > > > > >   Importing (pid=    0 ) oceanic fields at iteration        16
> > > > > > > >   Exporting (pid=    0 ) atmospheric fluxes at iter.        24
> > > > > > > >   Importing (pid=    0 ) oceanic fields at iteration        24
> > > > > > > >   Exporting (pid=    0 ) atmospheric fluxes at iter.        32
> > > > > > > >   Importing (pid=    0 ) oceanic fields at iteration        32
> > > > > > > > STOP NORMAL END
> > > > > > > > STOP NORMAL END
> > > > > > >
> > > > > > > Once all these steps are checked and are OK, can start to dig into
> > > the
> > > > > > > coupling
> > > > > > > log files.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Jean-Michel
> > > > > > >
> > > > > > > On Wed, Oct 03, 2012 at 04:21:49PM -0230, taimaz.bahadory wrote:
> > > > > > > > There is a "stdout" file generated in the main directory of run,
> > > with
> > > > > > > these
> > > > > > > > contents:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > ***********************************************************************************************************************************
> > > > > > > > CMA: unable to get RDMA device list
> > > > > > > > librdmacm: couldn't read ABI version.
> > > > > > > > librdmacm: assuming: 4
> > > > > > > > librdmacm: couldn't read ABI version.
> > > > > > > > librdmacm: assuming: 4
> > > > > > > > CMA: unable to get RDMA device list
> > > > > > > > librdmacm: couldn't read ABI version.
> > > > > > > > librdmacm: assuming: 4
> > > > > > > > CMA: unable to get RDMA device list
> > > > > > > >
> > > > > > >
> > > > >
> > > --------------------------------------------------------------------------
> > > > > > > > [[9900,1],2]: A high-performance Open MPI point-to-point
> > > messaging
> > > > > module
> > > > > > > > was unable to find any relevant network interfaces:
> > > > > > > >
> > > > > > > > Module: OpenFabrics (openib)
> > > > > > > >   Host: glacdyn
> > > > > > > >
> > > > > > > > Another transport will be used instead, although this may result
> > > in
> > > > > > > > lower performance.
> > > > > > > >
> > > > > > >
> > > > >
> > > --------------------------------------------------------------------------
> > > > > > > >  CPL_READ_PARAMS: nCouplingSteps=           5
> > > > > > > >  runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap=  3644
> > > > > > > >  ROmap:    1  599  598 0.100280
> > > > > > > >  ROmap: 3644 4402 4403 0.169626
> > > > > > > > [glacdyn:04864] 2 more processes have sent help message
> > > > > > > > help-mpi-btl-base.txt / btl:no-nics
> > > > > > > > [glacdyn:04864] Set MCA parameter "orte_base_help_aggregate" to
> > > 0 to
> > > > > see
> > > > > > > > all help / error messages
> > > > > > > >
> > > > > > >
> > > > >
> > > ***********************************************************************************************************************************
> > > > > > > >
> > > > > > > > Maybe there would be some relation between these error-like
> > > messages
> > > > > and
> > > > > > > > run stuck!
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Oct 3, 2012 at 2:13 PM, taimaz.bahadory <
> > > > > taimaz.bahadory at mun.ca
> > > > > > > >wrote:
> > > > > > > >
> > > > > > > > > Re-Hi;
> > > > > > > > >
> > > > > > > > > Yes; as I said, it stuck again. I check the CPU. It is fully
> > > > > loaded,
> > > > > > > but
> > > > > > > > > the output file is not updated! It is only a few seconds
> > > younger
> > > > > than
> > > > > > > the
> > > > > > > > > run initiation.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Oct 3, 2012 at 1:32 PM, taimaz.bahadory <
> > > > > > > taimaz.bahadory at mun.ca>wrote:
> > > > > > > > >
> > > > > > > > >> Hi;
> > > > > > > > >>
> > > > > > > > >> I guess I've tried it too, but the same problem occurred (I
> > > will
> > > > > try
> > > > > > > it
> > > > > > > > >> again right now to check it again).
> > > > > > > > >> Will report soon
> > > > > > > > >> Thanks
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On Wed, Oct 3, 2012 at 1:29 PM, Jean-Michel Campin <
> > > > > jmc at ocean.mit.edu
> > > > > > > >wrote:
> > > > > > > > >>
> > > > > > > > >>> Hi Taimaz,
> > > > > > > > >>>
> > > > > > > > >>> Can you try without MNC ? The current set-up (cpl_aim+ocn)
> > > does
> > > > > not
> > > > > > > > >>> use MNC (useMNC=.TRUE., is commented out in both
> > > > > input_atm/data.pkg
> > > > > > > > >>> and input_ocn/data.pkg) so if there was a problem in the
> > > coupled
> > > > > > > set-up
> > > > > > > > >>> code
> > > > > > > > >>> with NetCDF output, might not have seen it (since I did not
> > > try
> > > > > > > recently
> > > > > > > > >>> with it).
> > > > > > > > >>>
> > > > > > > > >>> Cheers,
> > > > > > > > >>> Jean-Michel
> > > > > > > > >>>
> > > > > > > > >>> On Tue, Sep 25, 2012 at 11:54:59AM -0230, taimaz.bahadory
> > > wrote:
> > > > > > > > >>> > Hi everybody;
> > > > > > > > >>> >
> > > > > > > > >>> > I'm trying to run the coupled model example (cpl_aim+ocn)
> > > in
> > > > > the
> > > > > > > > >>> > verification directory. All the three first steps
> > > (Cleaning;
> > > > > > > Compiling
> > > > > > > > >>> and
> > > > > > > > >>> > Making; Copying input files) passed with no error; but
> > > when I
> > > > > run
> > > > > > > the
> > > > > > > > >>> > coupler, it starts and creates the netCDF output files
> > > > > initially,
> > > > > > > but
> > > > > > > > >>> stops
> > > > > > > > >>> > updating them and also the output files, although the three
> > > > > > > "mitgcmuv"
> > > > > > > > >>> > files are still running! It's like a program freezing.
> > > > > > > > >>> > Has anybody been stuck in such a situation?
> > > > > > > > >>>
> > > > > > > > >>> > _______________________________________________
> > > > > > > > >>> > MITgcm-support mailing list
> > > > > > > > >>> > MITgcm-support at mitgcm.org
> > > > > > > > >>> > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> _______________________________________________
> > > > > > > > >>> MITgcm-support mailing list
> > > > > > > > >>> MITgcm-support at mitgcm.org
> > > > > > > > >>> http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >
> > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > MITgcm-support mailing list
> > > > > > > > MITgcm-support at mitgcm.org
> > > > > > > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > MITgcm-support mailing list
> > > > > > > MITgcm-support at mitgcm.org
> > > > > > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > > > >
> > > > >
> > > > > > _______________________________________________
> > > > > > MITgcm-support mailing list
> > > > > > MITgcm-support at mitgcm.org
> > > > > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > MITgcm-support mailing list
> > > > > MITgcm-support at mitgcm.org
> > > > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > >
> > >
> > > > _______________________________________________
> > > > MITgcm-support mailing list
> > > > MITgcm-support at mitgcm.org
> > > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > >
> > >
> > > _______________________________________________
> > > MITgcm-support mailing list
> > > MITgcm-support at mitgcm.org
> > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > >
> 
> 
> 
> 
> 
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-support
> 
> 
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support



More information about the MITgcm-support mailing list