[MITgcm-support] Coupled model running!

Wed Oct 10 14:06:12 EDT 2012

Hi Taimaz,

What I found curious (and unexplained) is that
- trying to run for 5/40 ocean/atmos time step, the atmos (rank_2)
  already start the main iteration loop before spinning:
> > > (PID.TID 0000.0001) SOLVE_FOR_PRESSURE: putPmEinXvector =    F
> > >  cg2d: Sum(rhs),rhsMax =   3.97903932025656E-12  4.56406615217961E+03
> > >  cg2d: Sum(rhs),rhsMax =  -1.00612851383630E-11  9.58133938764946E+03
  whereas his ocean does not seem to reach this stage (still in initialisation part).
- but trying to run for zero iteration (your last test), the ocean finishes
  but, apparently, the atmos did not reach the end of the initialisation

Presently looking at how to add a "call flush" to help debuging.

Cheers,
Jean-Michel

On Wed, Oct 10, 2012 at 02:14:08PM -0230, taimaz.bahadory wrote:
> Hi Jean;
> 
> Nothing is changed other than the netCDF options. I have also tried running
> without netCDF enabled, but the same problem occurred.
> It's still stuck. Here I attached those three files again, plus the
> "Coupler.0000.clog" file from "rank_0" directory.
> The program is still running (consuming CPUs) but these files are not
> getting updated anymore, nor do other files.
> Even I can send the whole directory if you wish!
> 
> 
> 
> 
> On Wed, Oct 10, 2012 at 1:07 PM, Jean-Michel Campin <jmc at ocean.mit.edu>wrote:
> 
> > Hi Taimaz,
> >
> > Apart from turning on "useMNC=.TRUE.,", did you changed some other
> > parameters/files from the standard sep-up in verification/cpl_aim+ocn ?
> >
> > And one thing you could try (since the atmosphere seems to get to the 1srt
> > iter, but hard to tell anyting about the ocean since could be due to
> > not flushing the buffer often enough) would be to run for zero iteration,
> > changing:
> > rank_0/data, 1rst line, replacing 5 with 0
> > rank_1/data, replacing nTimeSteps=5,  with nTimeSteps=0,
> > rank_2/data, replacing nTimeSteps=40, with nTimeSteps=0,
> >
> > Cheers,
> > Jean-Michel
> >
> > On Tue, Oct 09, 2012 at 06:11:27PM -0230, taimaz.bahadory wrote:
> > > Here is the "std_outp" contents:
> > >
> > >  CPL_READ_PARAMS: nCouplingSteps=           5
> > >  runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap=  3644
> > >  ROmap:    1  599  598 0.100280
> > >  ROmap: 3644 4402 4403 0.169626
> > >
> > >
> > > and the last lines of two STDOUT files:
> > >
> > > rank_1:
> > > ...
> > > ...
> > >
> > ++++++++++++++++++++++vtrv++++++....++++++++++++++++++++++++yxxyyyyy....q++py++zyxzzzzzyxxxwsuwwxxxx++++....+++++++++++++++++++++++yyxyzzzyy....+++++
> > > +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> > > (PID.TID 0000.0001)      4
> > >
> > ..++++++++++++++++++++++yyy+++++++....++++++++++++++++++++++++yyyzzzzz....snnny++zyyyyzzywvtuvtvwwxxwx+++y....++++++++++++++++++w+++xxyyzzzzzz....+++++
> > > +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> > > (PID.TID 0000.0001)      3
> > >
> > ..++++++++++++++++++++++++++++++++....+++++++++++++++++++++++++++++++z....tmnvz+zzzyxxxxwqpnotsuvwyxwy++++....y++++++++++++++++wv+++zyzzzzzyz+....+++++
> > > +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> > > (PID.TID 0000.0001)      2
> > >
> > ..++++++++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++....unny+zzzzzxxwwrnmmmqsuvvxxw+++++....y++++++++++++++++w++++zyyz+zyyy+....+++++
> > > +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> > > (PID.TID 0000.0001)      1
> > >
> > ..++++++++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++....uooy+yyzzyxwwqoomlmqtvutwyy+++++....y++++++++++++++++y++++zyyyzzzzz+....+++++
> > > +++++++++++++++++++++++++++....++++++++++++++++++y+++++++++++++..
> > >
> > >
> > > rank_2:
> > > ...
> > > ...
> > > (PID.TID 0000.0001) // Model current state
> > > (PID.TID 0000.0001) //
> > > =======================================================
> > > (PID.TID 0000.0001)
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: albedo_cs32.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > vegetFrc.cpl_FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > seaSurfT.cpl_FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > seaSurfT.cpl_FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > lndSurfT.cpl_FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > lndSurfT.cpl_FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > seaIce.cpl3FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > seaIce.cpl3FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > snowDepth.cpl_FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > snowDepth.cpl_FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > soilMoist.cpl_FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > soilMoist.cpl_FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > soilMoist.cpl_FM.bin
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> > > soilMoist.cpl_FM.bin
> > > (PID.TID 0000.0001) SOLVE_FOR_PRESSURE: putPmEinXvector =    F
> > >  cg2d: Sum(rhs),rhsMax =   3.97903932025656E-12  4.56406615217961E+03
> > >  cg2d: Sum(rhs),rhsMax =  -1.00612851383630E-11  9.58133938764946E+03
> > >
> > >
> > >
> > > Thanks again for your time
> > >
> > >
> > >
> > > On Tue, Oct 9, 2012 at 5:01 PM, Jean-Michel Campin <jmc at ocean.mit.edu
> > >wrote:
> > >
> > > > Hi Taimaz,
> > > >
> > > > Looks like everything is normal. The next thing to check is:
> > > >
> > > > Could you check and send the content of file:
> > > >   cpl_aim+ocn/std_out
> > > > which is the standard output of mpirun ?
> > > >
> > > > And can you also check the last part of ocean & atmos STDOUT:
> > > >  rank_1/STDOUT.0000
> > > >  rank_2/STDOUT.0000
> > > > to see where each component is stuck ?
> > > >
> > > > Jean-Michel
> > > >
> > > > On Tue, Oct 09, 2012 at 02:35:48PM -0230, taimaz.bahadory wrote:
> > > > > Thank for your complete reply;
> > > > >
> > > > > 1) Yes, I did. I have run the "aim.5l_LatLon" before with MPI
> > enabled (40
> > > > > CPUs) without any problem
> > > > > 2) I've found out it before, and disabled the whole section of
> > finding
> > > > the
> > > > > optfile, and replaced it with mine, which refers to a modified
> > > > > "linux_amd64_gfortran" opt-file which points to my correct MPI and
> > netCDF
> > > > > addresses (I used it for my previous runs too with no error)
> > > > > 3) Here is the only thing printed on my screen after running
> > > > > "./run_cpl_test 3":
> > > > >
> > > > >    /home/tbahador/programs/MITgcm/verification/cpl_aim+ocn/run/tt
> > > > >    execute 'mpirun -np 1 ./build_cpl/mitgcmuv : -np 1
> > > > ./build_ocn/mitgcmuv
> > > > > : -np 1 ./build_atm/mitgcmuv' :
> > > > >
> > > > > and it freezes there then.
> > > > > But as I check the three "rank" directories, there are mnc_*
> > directories
> > > > > and some other output files generated there, which shows that they
> > are
> > > > > initially created by, say, MPI, but no further update to none of
> > them!
> > > > Here
> > > > > is where I'm stuck in.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Oct 9, 2012 at 12:21 PM, Jean-Michel Campin <
> > jmc at ocean.mit.edu
> > > > >wrote:
> > > > >
> > > > > > Hi Taimaz,
> > > > > >
> > > > > > The coupled set-up is used by several users on different
> > > > > > platforms, so we should find a way for you to run it.
> > > > > > But regarding the script "run_cpl_test" in
> > verification/cpl_aim+ocn/
> > > > > > it has not been used so much (plus it pre-date some changes
> > > > > > in genmake2) and could have been better written.
> > > > > >
> > > > > > So, will need to check each step to see where the problem is.
> > > > > >
> > > > > > 1) have you tried to run a simple (i.e., not coupled) verification
> > > > > >   experiment using MPI ? this would confirm that libs and mpirun
> > > > > >   are working well on your platform.
> > > > > >
> > > > > > 2) need to check which optfile is being used (run_cpl_test is not
> > > > > >   well written regarding this optfile selection and it expects an
> > > > > >   optfile "*+mpi" in verification directory !).
> > > > > >   "run_cpl_test 2" command should write it as:
> > > > > >   >  Using optfile: OPTFILE_NAME (compiler=COMPILER_NAME)
> > > > > >   Might be useful also to sent the 1rst 100 lines of
> > build_atm/Makefile
> > > > > >   just to check.
> > > > > >
> > > > > > 3) need to check if run_cpl_test recognizes an OpenMPI built and
> > > > > >   proceeds with the right command.
> > > > > >   Could you send all the printed information that "run_cpl_test 3"
> > > > > >   is producing ? the command should be printed as:
> > > > > >   > execute 'mpirun ...
> > > > > >   In my case, a successful run using OpenMPI gives me:
> > > > > > > execute 'mpirun -np 1 ./build_cpl/mitgcmuv : -np 1
> > > > ./build_ocn/mitgcmuv
> > > > > > : -np 1 ./build_atm/mitgcmuv' :
> > > > > > >  MITCPLR_init1:            2  UV-Atmos MPI_Comm_create
> > > > > > MPI_COMM_compcplr=           6  ierr=           0
> > > > > > >  MITCPLR_init1:            2  UV-Atmos component num=           2
> > > > > >  MPI_COMM=           5           6
> > > > > > >  MITCPLR_init1:            2  UV-Atmos Rank/Size =            1
> >  /
> > > > > >     2
> > > > > > >  MITCPLR_init1:            1  UV-Ocean Rank/Size =            1
> >  /
> > > > > >     2
> > > > > > >  CPL_READ_PARAMS: nCouplingSteps=           5
> > > > > > >  runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap=  3644
> > > > > > >  ROmap:    1  599  598 0.100280
> > > > > > >  ROmap: 3644 4402 4403 0.169626
> > > > > > >   Exporting (pid=    0 ) atmospheric fluxes at iter.         0
> > > > > > >   Importing (pid=    0 ) oceanic fields at iteration         0
> > > > > > >   Exporting (pid=    0 ) atmospheric fluxes at iter.         8
> > > > > > >   Importing (pid=    0 ) oceanic fields at iteration         8
> > > > > > >   Exporting (pid=    0 ) atmospheric fluxes at iter.        16
> > > > > > >   Importing (pid=    0 ) oceanic fields at iteration        16
> > > > > > >   Exporting (pid=    0 ) atmospheric fluxes at iter.        24
> > > > > > >   Importing (pid=    0 ) oceanic fields at iteration        24
> > > > > > >   Exporting (pid=    0 ) atmospheric fluxes at iter.        32
> > > > > > >   Importing (pid=    0 ) oceanic fields at iteration        32
> > > > > > > STOP NORMAL END
> > > > > > > STOP NORMAL END
> > > > > >
> > > > > > Once all these steps are checked and are OK, can start to dig into
> > the
> > > > > > coupling
> > > > > > log files.
> > > > > >
> > > > > > Cheers,
> > > > > > Jean-Michel
> > > > > >
> > > > > > On Wed, Oct 03, 2012 at 04:21:49PM -0230, taimaz.bahadory wrote:
> > > > > > > There is a "stdout" file generated in the main directory of run,
> > with
> > > > > > these
> > > > > > > contents:
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > ***********************************************************************************************************************************
> > > > > > > CMA: unable to get RDMA device list
> > > > > > > librdmacm: couldn't read ABI version.
> > > > > > > librdmacm: assuming: 4
> > > > > > > librdmacm: couldn't read ABI version.
> > > > > > > librdmacm: assuming: 4
> > > > > > > CMA: unable to get RDMA device list
> > > > > > > librdmacm: couldn't read ABI version.
> > > > > > > librdmacm: assuming: 4
> > > > > > > CMA: unable to get RDMA device list
> > > > > > >
> > > > > >
> > > >
> > --------------------------------------------------------------------------
> > > > > > > [[9900,1],2]: A high-performance Open MPI point-to-point
> > messaging
> > > > module
> > > > > > > was unable to find any relevant network interfaces:
> > > > > > >
> > > > > > > Module: OpenFabrics (openib)
> > > > > > >   Host: glacdyn
> > > > > > >
> > > > > > > Another transport will be used instead, although this may result
> > in
> > > > > > > lower performance.
> > > > > > >
> > > > > >
> > > >
> > --------------------------------------------------------------------------
> > > > > > >  CPL_READ_PARAMS: nCouplingSteps=           5
> > > > > > >  runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap=  3644
> > > > > > >  ROmap:    1  599  598 0.100280
> > > > > > >  ROmap: 3644 4402 4403 0.169626
> > > > > > > [glacdyn:04864] 2 more processes have sent help message
> > > > > > > help-mpi-btl-base.txt / btl:no-nics
> > > > > > > [glacdyn:04864] Set MCA parameter "orte_base_help_aggregate" to
> > 0 to
> > > > see
> > > > > > > all help / error messages
> > > > > > >
> > > > > >
> > > >
> > ***********************************************************************************************************************************
> > > > > > >
> > > > > > > Maybe there would be some relation between these error-like
> > messages
> > > > and
> > > > > > > run stuck!
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Oct 3, 2012 at 2:13 PM, taimaz.bahadory <
> > > > taimaz.bahadory at mun.ca
> > > > > > >wrote:
> > > > > > >
> > > > > > > > Re-Hi;
> > > > > > > >
> > > > > > > > Yes; as I said, it stuck again. I check the CPU. It is fully
> > > > loaded,
> > > > > > but
> > > > > > > > the output file is not updated! It is only a few seconds
> > younger
> > > > than
> > > > > > the
> > > > > > > > run initiation.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Oct 3, 2012 at 1:32 PM, taimaz.bahadory <
> > > > > > taimaz.bahadory at mun.ca>wrote:
> > > > > > > >
> > > > > > > >> Hi;
> > > > > > > >>
> > > > > > > >> I guess I've tried it too, but the same problem occurred (I
> > will
> > > > try
> > > > > > it
> > > > > > > >> again right now to check it again).
> > > > > > > >> Will report soon
> > > > > > > >> Thanks
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Wed, Oct 3, 2012 at 1:29 PM, Jean-Michel Campin <
> > > > jmc at ocean.mit.edu
> > > > > > >wrote:
> > > > > > > >>
> > > > > > > >>> Hi Taimaz,
> > > > > > > >>>
> > > > > > > >>> Can you try without MNC ? The current set-up (cpl_aim+ocn)
> > does
> > > > not
> > > > > > > >>> use MNC (useMNC=.TRUE., is commented out in both
> > > > input_atm/data.pkg
> > > > > > > >>> and input_ocn/data.pkg) so if there was a problem in the
> > coupled
> > > > > > set-up
> > > > > > > >>> code
> > > > > > > >>> with NetCDF output, might not have seen it (since I did not
> > try
> > > > > > recently
> > > > > > > >>> with it).
> > > > > > > >>>
> > > > > > > >>> Cheers,
> > > > > > > >>> Jean-Michel
> > > > > > > >>>
> > > > > > > >>> On Tue, Sep 25, 2012 at 11:54:59AM -0230, taimaz.bahadory
> > wrote:
> > > > > > > >>> > Hi everybody;
> > > > > > > >>> >
> > > > > > > >>> > I'm trying to run the coupled model example (cpl_aim+ocn)
> > in
> > > > the
> > > > > > > >>> > verification directory. All the three first steps
> > (Cleaning;
> > > > > > Compiling
> > > > > > > >>> and
> > > > > > > >>> > Making; Copying input files) passed with no error; but
> > when I
> > > > run
> > > > > > the
> > > > > > > >>> > coupler, it starts and creates the netCDF output files
> > > > initially,
> > > > > > but
> > > > > > > >>> stops
> > > > > > > >>> > updating them and also the output files, although the three
> > > > > > "mitgcmuv"
> > > > > > > >>> > files are still running! It's like a program freezing.
> > > > > > > >>> > Has anybody been stuck in such a situation?
> > > > > > > >>>
> > > > > > > >>> > _______________________________________________
> > > > > > > >>> > MITgcm-support mailing list
> > > > > > > >>> > MITgcm-support at mitgcm.org
> > > > > > > >>> > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> _______________________________________________
> > > > > > > >>> MITgcm-support mailing list
> > > > > > > >>> MITgcm-support at mitgcm.org
> > > > > > > >>> http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >
> > > > > >
> > > > > > > _______________________________________________
> > > > > > > MITgcm-support mailing list
> > > > > > > MITgcm-support at mitgcm.org
> > > > > > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > MITgcm-support mailing list
> > > > > > MITgcm-support at mitgcm.org
> > > > > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > > >
> > > >
> > > > > _______________________________________________
> > > > > MITgcm-support mailing list
> > > > > MITgcm-support at mitgcm.org
> > > > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > >
> > > >
> > > > _______________________________________________
> > > > MITgcm-support mailing list
> > > > MITgcm-support at mitgcm.org
> > > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > >
> >
> > > _______________________________________________
> > > MITgcm-support mailing list
> > > MITgcm-support at mitgcm.org
> > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> >
> >
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-support
> >

> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support