[MITgcm-support] Coupled model running!

Tue Oct 9 13:05:48 EDT 2012

Thank for your complete reply;

1) Yes, I did. I have run the "aim.5l_LatLon" before with MPI enabled (40
CPUs) without any problem
2) I've found out it before, and disabled the whole section of finding the
optfile, and replaced it with mine, which refers to a modified
"linux_amd64_gfortran" opt-file which points to my correct MPI and netCDF
addresses (I used it for my previous runs too with no error)
3) Here is the only thing printed on my screen after running
"./run_cpl_test 3":

   /home/tbahador/programs/MITgcm/verification/cpl_aim+ocn/run/tt
   execute 'mpirun -np 1 ./build_cpl/mitgcmuv : -np 1 ./build_ocn/mitgcmuv
: -np 1 ./build_atm/mitgcmuv' :

and it freezes there then.
But as I check the three "rank" directories, there are mnc_* directories
and some other output files generated there, which shows that they are
initially created by, say, MPI, but no further update to none of them! Here
is where I'm stuck in.

On Tue, Oct 9, 2012 at 12:21 PM, Jean-Michel Campin <jmc at ocean.mit.edu>wrote:

> Hi Taimaz,
>
> The coupled set-up is used by several users on different
> platforms, so we should find a way for you to run it.
> But regarding the script "run_cpl_test" in verification/cpl_aim+ocn/
> it has not been used so much (plus it pre-date some changes
> in genmake2) and could have been better written.
>
> So, will need to check each step to see where the problem is.
>
> 1) have you tried to run a simple (i.e., not coupled) verification
>   experiment using MPI ? this would confirm that libs and mpirun
>   are working well on your platform.
>
> 2) need to check which optfile is being used (run_cpl_test is not
>   well written regarding this optfile selection and it expects an
>   optfile "*+mpi" in verification directory !).
>   "run_cpl_test 2" command should write it as:
>   >  Using optfile: OPTFILE_NAME (compiler=COMPILER_NAME)
>   Might be useful also to sent the 1rst 100 lines of build_atm/Makefile
>   just to check.
>
> 3) need to check if run_cpl_test recognizes an OpenMPI built and
>   proceeds with the right command.
>   Could you send all the printed information that "run_cpl_test 3"
>   is producing ? the command should be printed as:
>   > execute 'mpirun ...
>   In my case, a successful run using OpenMPI gives me:
> > execute 'mpirun -np 1 ./build_cpl/mitgcmuv : -np 1 ./build_ocn/mitgcmuv
> : -np 1 ./build_atm/mitgcmuv' :
> >  MITCPLR_init1:            2  UV-Atmos MPI_Comm_create
> MPI_COMM_compcplr=           6  ierr=           0
> >  MITCPLR_init1:            2  UV-Atmos component num=           2
>  MPI_COMM=           5           6
> >  MITCPLR_init1:            2  UV-Atmos Rank/Size =            1  /
>     2
> >  MITCPLR_init1:            1  UV-Ocean Rank/Size =            1  /
>     2
> >  CPL_READ_PARAMS: nCouplingSteps=           5
> >  runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap=  3644
> >  ROmap:    1  599  598 0.100280
> >  ROmap: 3644 4402 4403 0.169626
> >   Exporting (pid=    0 ) atmospheric fluxes at iter.         0
> >   Importing (pid=    0 ) oceanic fields at iteration         0
> >   Exporting (pid=    0 ) atmospheric fluxes at iter.         8
> >   Importing (pid=    0 ) oceanic fields at iteration         8
> >   Exporting (pid=    0 ) atmospheric fluxes at iter.        16
> >   Importing (pid=    0 ) oceanic fields at iteration        16
> >   Exporting (pid=    0 ) atmospheric fluxes at iter.        24
> >   Importing (pid=    0 ) oceanic fields at iteration        24
> >   Exporting (pid=    0 ) atmospheric fluxes at iter.        32
> >   Importing (pid=    0 ) oceanic fields at iteration        32
> > STOP NORMAL END
> > STOP NORMAL END
>
> Once all these steps are checked and are OK, can start to dig into the
> coupling
> log files.
>
> Cheers,
> Jean-Michel
>
> On Wed, Oct 03, 2012 at 04:21:49PM -0230, taimaz.bahadory wrote:
> > There is a "stdout" file generated in the main directory of run, with
> these
> > contents:
> >
> >
> ***********************************************************************************************************************************
> > CMA: unable to get RDMA device list
> > librdmacm: couldn't read ABI version.
> > librdmacm: assuming: 4
> > librdmacm: couldn't read ABI version.
> > librdmacm: assuming: 4
> > CMA: unable to get RDMA device list
> > librdmacm: couldn't read ABI version.
> > librdmacm: assuming: 4
> > CMA: unable to get RDMA device list
> >
> --------------------------------------------------------------------------
> > [[9900,1],2]: A high-performance Open MPI point-to-point messaging module
> > was unable to find any relevant network interfaces:
> >
> > Module: OpenFabrics (openib)
> >   Host: glacdyn
> >
> > Another transport will be used instead, although this may result in
> > lower performance.
> >
> --------------------------------------------------------------------------
> >  CPL_READ_PARAMS: nCouplingSteps=           5
> >  runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap=  3644
> >  ROmap:    1  599  598 0.100280
> >  ROmap: 3644 4402 4403 0.169626
> > [glacdyn:04864] 2 more processes have sent help message
> > help-mpi-btl-base.txt / btl:no-nics
> > [glacdyn:04864] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> > all help / error messages
> >
> ***********************************************************************************************************************************
> >
> > Maybe there would be some relation between these error-like messages and
> > run stuck!
> >
> >
> >
> >
> >
> > On Wed, Oct 3, 2012 at 2:13 PM, taimaz.bahadory <taimaz.bahadory at mun.ca
> >wrote:
> >
> > > Re-Hi;
> > >
> > > Yes; as I said, it stuck again. I check the CPU. It is fully loaded,
> but
> > > the output file is not updated! It is only a few seconds younger than
> the
> > > run initiation.
> > >
> > >
> > >
> > > On Wed, Oct 3, 2012 at 1:32 PM, taimaz.bahadory <
> taimaz.bahadory at mun.ca>wrote:
> > >
> > >> Hi;
> > >>
> > >> I guess I've tried it too, but the same problem occurred (I will try
> it
> > >> again right now to check it again).
> > >> Will report soon
> > >> Thanks
> > >>
> > >>
> > >>
> > >> On Wed, Oct 3, 2012 at 1:29 PM, Jean-Michel Campin <jmc at ocean.mit.edu
> >wrote:
> > >>
> > >>> Hi Taimaz,
> > >>>
> > >>> Can you try without MNC ? The current set-up (cpl_aim+ocn) does not
> > >>> use MNC (useMNC=.TRUE., is commented out in both input_atm/data.pkg
> > >>> and input_ocn/data.pkg) so if there was a problem in the coupled
> set-up
> > >>> code
> > >>> with NetCDF output, might not have seen it (since I did not try
> recently
> > >>> with it).
> > >>>
> > >>> Cheers,
> > >>> Jean-Michel
> > >>>
> > >>> On Tue, Sep 25, 2012 at 11:54:59AM -0230, taimaz.bahadory wrote:
> > >>> > Hi everybody;
> > >>> >
> > >>> > I'm trying to run the coupled model example (cpl_aim+ocn) in the
> > >>> > verification directory. All the three first steps (Cleaning;
> Compiling
> > >>> and
> > >>> > Making; Copying input files) passed with no error; but when I run
> the
> > >>> > coupler, it starts and creates the netCDF output files initially,
> but
> > >>> stops
> > >>> > updating them and also the output files, although the three
> "mitgcmuv"
> > >>> > files are still running! It's like a program freezing.
> > >>> > Has anybody been stuck in such a situation?
> > >>>
> > >>> > _______________________________________________
> > >>> > MITgcm-support mailing list
> > >>> > MITgcm-support at mitgcm.org
> > >>> > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> MITgcm-support mailing list
> > >>> MITgcm-support at mitgcm.org
> > >>> http://mitgcm.org/mailman/listinfo/mitgcm-support
> > >>>
> > >>
> > >>
> > >
>
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-support
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20121009/fd7c3550/attachment.htm>