[MITgcm-support] Coupled model running!
Jean-Michel Campin
jmc at ocean.mit.edu
Tue Oct 9 15:31:01 EDT 2012
Hi Taimaz,
Looks like everything is normal. The next thing to check is:
Could you check and send the content of file:
cpl_aim+ocn/std_out
which is the standard output of mpirun ?
And can you also check the last part of ocean & atmos STDOUT:
rank_1/STDOUT.0000
rank_2/STDOUT.0000
to see where each component is stuck ?
Jean-Michel
On Tue, Oct 09, 2012 at 02:35:48PM -0230, taimaz.bahadory wrote:
> Thank for your complete reply;
>
> 1) Yes, I did. I have run the "aim.5l_LatLon" before with MPI enabled (40
> CPUs) without any problem
> 2) I've found out it before, and disabled the whole section of finding the
> optfile, and replaced it with mine, which refers to a modified
> "linux_amd64_gfortran" opt-file which points to my correct MPI and netCDF
> addresses (I used it for my previous runs too with no error)
> 3) Here is the only thing printed on my screen after running
> "./run_cpl_test 3":
>
> /home/tbahador/programs/MITgcm/verification/cpl_aim+ocn/run/tt
> execute 'mpirun -np 1 ./build_cpl/mitgcmuv : -np 1 ./build_ocn/mitgcmuv
> : -np 1 ./build_atm/mitgcmuv' :
>
> and it freezes there then.
> But as I check the three "rank" directories, there are mnc_* directories
> and some other output files generated there, which shows that they are
> initially created by, say, MPI, but no further update to none of them! Here
> is where I'm stuck in.
>
>
>
>
>
> On Tue, Oct 9, 2012 at 12:21 PM, Jean-Michel Campin <jmc at ocean.mit.edu>wrote:
>
> > Hi Taimaz,
> >
> > The coupled set-up is used by several users on different
> > platforms, so we should find a way for you to run it.
> > But regarding the script "run_cpl_test" in verification/cpl_aim+ocn/
> > it has not been used so much (plus it pre-date some changes
> > in genmake2) and could have been better written.
> >
> > So, will need to check each step to see where the problem is.
> >
> > 1) have you tried to run a simple (i.e., not coupled) verification
> > experiment using MPI ? this would confirm that libs and mpirun
> > are working well on your platform.
> >
> > 2) need to check which optfile is being used (run_cpl_test is not
> > well written regarding this optfile selection and it expects an
> > optfile "*+mpi" in verification directory !).
> > "run_cpl_test 2" command should write it as:
> > > Using optfile: OPTFILE_NAME (compiler=COMPILER_NAME)
> > Might be useful also to sent the 1rst 100 lines of build_atm/Makefile
> > just to check.
> >
> > 3) need to check if run_cpl_test recognizes an OpenMPI built and
> > proceeds with the right command.
> > Could you send all the printed information that "run_cpl_test 3"
> > is producing ? the command should be printed as:
> > > execute 'mpirun ...
> > In my case, a successful run using OpenMPI gives me:
> > > execute 'mpirun -np 1 ./build_cpl/mitgcmuv : -np 1 ./build_ocn/mitgcmuv
> > : -np 1 ./build_atm/mitgcmuv' :
> > > MITCPLR_init1: 2 UV-Atmos MPI_Comm_create
> > MPI_COMM_compcplr= 6 ierr= 0
> > > MITCPLR_init1: 2 UV-Atmos component num= 2
> > MPI_COMM= 5 6
> > > MITCPLR_init1: 2 UV-Atmos Rank/Size = 1 /
> > 2
> > > MITCPLR_init1: 1 UV-Ocean Rank/Size = 1 /
> > 2
> > > CPL_READ_PARAMS: nCouplingSteps= 5
> > > runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap= 3644
> > > ROmap: 1 599 598 0.100280
> > > ROmap: 3644 4402 4403 0.169626
> > > Exporting (pid= 0 ) atmospheric fluxes at iter. 0
> > > Importing (pid= 0 ) oceanic fields at iteration 0
> > > Exporting (pid= 0 ) atmospheric fluxes at iter. 8
> > > Importing (pid= 0 ) oceanic fields at iteration 8
> > > Exporting (pid= 0 ) atmospheric fluxes at iter. 16
> > > Importing (pid= 0 ) oceanic fields at iteration 16
> > > Exporting (pid= 0 ) atmospheric fluxes at iter. 24
> > > Importing (pid= 0 ) oceanic fields at iteration 24
> > > Exporting (pid= 0 ) atmospheric fluxes at iter. 32
> > > Importing (pid= 0 ) oceanic fields at iteration 32
> > > STOP NORMAL END
> > > STOP NORMAL END
> >
> > Once all these steps are checked and are OK, can start to dig into the
> > coupling
> > log files.
> >
> > Cheers,
> > Jean-Michel
> >
> > On Wed, Oct 03, 2012 at 04:21:49PM -0230, taimaz.bahadory wrote:
> > > There is a "stdout" file generated in the main directory of run, with
> > these
> > > contents:
> > >
> > >
> > ***********************************************************************************************************************************
> > > CMA: unable to get RDMA device list
> > > librdmacm: couldn't read ABI version.
> > > librdmacm: assuming: 4
> > > librdmacm: couldn't read ABI version.
> > > librdmacm: assuming: 4
> > > CMA: unable to get RDMA device list
> > > librdmacm: couldn't read ABI version.
> > > librdmacm: assuming: 4
> > > CMA: unable to get RDMA device list
> > >
> > --------------------------------------------------------------------------
> > > [[9900,1],2]: A high-performance Open MPI point-to-point messaging module
> > > was unable to find any relevant network interfaces:
> > >
> > > Module: OpenFabrics (openib)
> > > Host: glacdyn
> > >
> > > Another transport will be used instead, although this may result in
> > > lower performance.
> > >
> > --------------------------------------------------------------------------
> > > CPL_READ_PARAMS: nCouplingSteps= 5
> > > runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap= 3644
> > > ROmap: 1 599 598 0.100280
> > > ROmap: 3644 4402 4403 0.169626
> > > [glacdyn:04864] 2 more processes have sent help message
> > > help-mpi-btl-base.txt / btl:no-nics
> > > [glacdyn:04864] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> > > all help / error messages
> > >
> > ***********************************************************************************************************************************
> > >
> > > Maybe there would be some relation between these error-like messages and
> > > run stuck!
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Oct 3, 2012 at 2:13 PM, taimaz.bahadory <taimaz.bahadory at mun.ca
> > >wrote:
> > >
> > > > Re-Hi;
> > > >
> > > > Yes; as I said, it stuck again. I check the CPU. It is fully loaded,
> > but
> > > > the output file is not updated! It is only a few seconds younger than
> > the
> > > > run initiation.
> > > >
> > > >
> > > >
> > > > On Wed, Oct 3, 2012 at 1:32 PM, taimaz.bahadory <
> > taimaz.bahadory at mun.ca>wrote:
> > > >
> > > >> Hi;
> > > >>
> > > >> I guess I've tried it too, but the same problem occurred (I will try
> > it
> > > >> again right now to check it again).
> > > >> Will report soon
> > > >> Thanks
> > > >>
> > > >>
> > > >>
> > > >> On Wed, Oct 3, 2012 at 1:29 PM, Jean-Michel Campin <jmc at ocean.mit.edu
> > >wrote:
> > > >>
> > > >>> Hi Taimaz,
> > > >>>
> > > >>> Can you try without MNC ? The current set-up (cpl_aim+ocn) does not
> > > >>> use MNC (useMNC=.TRUE., is commented out in both input_atm/data.pkg
> > > >>> and input_ocn/data.pkg) so if there was a problem in the coupled
> > set-up
> > > >>> code
> > > >>> with NetCDF output, might not have seen it (since I did not try
> > recently
> > > >>> with it).
> > > >>>
> > > >>> Cheers,
> > > >>> Jean-Michel
> > > >>>
> > > >>> On Tue, Sep 25, 2012 at 11:54:59AM -0230, taimaz.bahadory wrote:
> > > >>> > Hi everybody;
> > > >>> >
> > > >>> > I'm trying to run the coupled model example (cpl_aim+ocn) in the
> > > >>> > verification directory. All the three first steps (Cleaning;
> > Compiling
> > > >>> and
> > > >>> > Making; Copying input files) passed with no error; but when I run
> > the
> > > >>> > coupler, it starts and creates the netCDF output files initially,
> > but
> > > >>> stops
> > > >>> > updating them and also the output files, although the three
> > "mitgcmuv"
> > > >>> > files are still running! It's like a program freezing.
> > > >>> > Has anybody been stuck in such a situation?
> > > >>>
> > > >>> > _______________________________________________
> > > >>> > MITgcm-support mailing list
> > > >>> > MITgcm-support at mitgcm.org
> > > >>> > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > >>>
> > > >>>
> > > >>> _______________________________________________
> > > >>> MITgcm-support mailing list
> > > >>> MITgcm-support at mitgcm.org
> > > >>> http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > >>>
> > > >>
> > > >>
> > > >
> >
> > > _______________________________________________
> > > MITgcm-support mailing list
> > > MITgcm-support at mitgcm.org
> > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> >
> >
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-support
> >
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
More information about the MITgcm-support
mailing list