[MITgcm-support] Coupled model running!
Jean-Michel Campin
jmc at ocean.mit.edu
Wed Oct 10 11:37:35 EDT 2012
Hi Taimaz,
Apart from turning on "useMNC=.TRUE.,", did you changed some other
parameters/files from the standard sep-up in verification/cpl_aim+ocn ?
And one thing you could try (since the atmosphere seems to get to the 1srt
iter, but hard to tell anyting about the ocean since could be due to
not flushing the buffer often enough) would be to run for zero iteration,
changing:
rank_0/data, 1rst line, replacing 5 with 0
rank_1/data, replacing nTimeSteps=5, with nTimeSteps=0,
rank_2/data, replacing nTimeSteps=40, with nTimeSteps=0,
Cheers,
Jean-Michel
On Tue, Oct 09, 2012 at 06:11:27PM -0230, taimaz.bahadory wrote:
> Here is the "std_outp" contents:
>
> CPL_READ_PARAMS: nCouplingSteps= 5
> runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap= 3644
> ROmap: 1 599 598 0.100280
> ROmap: 3644 4402 4403 0.169626
>
>
> and the last lines of two STDOUT files:
>
> rank_1:
> ...
> ...
> ++++++++++++++++++++++vtrv++++++....++++++++++++++++++++++++yxxyyyyy....q++py++zyxzzzzzyxxxwsuwwxxxx++++....+++++++++++++++++++++++yyxyzzzyy....+++++
> +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> (PID.TID 0000.0001) 4
> ..++++++++++++++++++++++yyy+++++++....++++++++++++++++++++++++yyyzzzzz....snnny++zyyyyzzywvtuvtvwwxxwx+++y....++++++++++++++++++w+++xxyyzzzzzz....+++++
> +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> (PID.TID 0000.0001) 3
> ..++++++++++++++++++++++++++++++++....+++++++++++++++++++++++++++++++z....tmnvz+zzzyxxxxwqpnotsuvwyxwy++++....y++++++++++++++++wv+++zyzzzzzyz+....+++++
> +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> (PID.TID 0000.0001) 2
> ..++++++++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++....unny+zzzzzxxwwrnmmmqsuvvxxw+++++....y++++++++++++++++w++++zyyz+zyyy+....+++++
> +++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++..
> (PID.TID 0000.0001) 1
> ..++++++++++++++++++++++++++++++++....++++++++++++++++++++++++++++++++....uooy+yyzzyxwwqoomlmqtvutwyy+++++....y++++++++++++++++y++++zyyyzzzzz+....+++++
> +++++++++++++++++++++++++++....++++++++++++++++++y+++++++++++++..
>
>
> rank_2:
> ...
> ...
> (PID.TID 0000.0001) // Model current state
> (PID.TID 0000.0001) //
> =======================================================
> (PID.TID 0000.0001)
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file: albedo_cs32.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file:
> vegetFrc.cpl_FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file:
> seaSurfT.cpl_FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file:
> seaSurfT.cpl_FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file:
> lndSurfT.cpl_FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file:
> lndSurfT.cpl_FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file: seaIce.cpl3FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file: seaIce.cpl3FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file:
> snowDepth.cpl_FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file:
> snowDepth.cpl_FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file:
> soilMoist.cpl_FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file:
> soilMoist.cpl_FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file:
> soilMoist.cpl_FM.bin
> (PID.TID 0000.0001) MDS_READ_FIELD: opening global file:
> soilMoist.cpl_FM.bin
> (PID.TID 0000.0001) SOLVE_FOR_PRESSURE: putPmEinXvector = F
> cg2d: Sum(rhs),rhsMax = 3.97903932025656E-12 4.56406615217961E+03
> cg2d: Sum(rhs),rhsMax = -1.00612851383630E-11 9.58133938764946E+03
>
>
>
> Thanks again for your time
>
>
>
> On Tue, Oct 9, 2012 at 5:01 PM, Jean-Michel Campin <jmc at ocean.mit.edu>wrote:
>
> > Hi Taimaz,
> >
> > Looks like everything is normal. The next thing to check is:
> >
> > Could you check and send the content of file:
> > cpl_aim+ocn/std_out
> > which is the standard output of mpirun ?
> >
> > And can you also check the last part of ocean & atmos STDOUT:
> > rank_1/STDOUT.0000
> > rank_2/STDOUT.0000
> > to see where each component is stuck ?
> >
> > Jean-Michel
> >
> > On Tue, Oct 09, 2012 at 02:35:48PM -0230, taimaz.bahadory wrote:
> > > Thank for your complete reply;
> > >
> > > 1) Yes, I did. I have run the "aim.5l_LatLon" before with MPI enabled (40
> > > CPUs) without any problem
> > > 2) I've found out it before, and disabled the whole section of finding
> > the
> > > optfile, and replaced it with mine, which refers to a modified
> > > "linux_amd64_gfortran" opt-file which points to my correct MPI and netCDF
> > > addresses (I used it for my previous runs too with no error)
> > > 3) Here is the only thing printed on my screen after running
> > > "./run_cpl_test 3":
> > >
> > > /home/tbahador/programs/MITgcm/verification/cpl_aim+ocn/run/tt
> > > execute 'mpirun -np 1 ./build_cpl/mitgcmuv : -np 1
> > ./build_ocn/mitgcmuv
> > > : -np 1 ./build_atm/mitgcmuv' :
> > >
> > > and it freezes there then.
> > > But as I check the three "rank" directories, there are mnc_* directories
> > > and some other output files generated there, which shows that they are
> > > initially created by, say, MPI, but no further update to none of them!
> > Here
> > > is where I'm stuck in.
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Oct 9, 2012 at 12:21 PM, Jean-Michel Campin <jmc at ocean.mit.edu
> > >wrote:
> > >
> > > > Hi Taimaz,
> > > >
> > > > The coupled set-up is used by several users on different
> > > > platforms, so we should find a way for you to run it.
> > > > But regarding the script "run_cpl_test" in verification/cpl_aim+ocn/
> > > > it has not been used so much (plus it pre-date some changes
> > > > in genmake2) and could have been better written.
> > > >
> > > > So, will need to check each step to see where the problem is.
> > > >
> > > > 1) have you tried to run a simple (i.e., not coupled) verification
> > > > experiment using MPI ? this would confirm that libs and mpirun
> > > > are working well on your platform.
> > > >
> > > > 2) need to check which optfile is being used (run_cpl_test is not
> > > > well written regarding this optfile selection and it expects an
> > > > optfile "*+mpi" in verification directory !).
> > > > "run_cpl_test 2" command should write it as:
> > > > > Using optfile: OPTFILE_NAME (compiler=COMPILER_NAME)
> > > > Might be useful also to sent the 1rst 100 lines of build_atm/Makefile
> > > > just to check.
> > > >
> > > > 3) need to check if run_cpl_test recognizes an OpenMPI built and
> > > > proceeds with the right command.
> > > > Could you send all the printed information that "run_cpl_test 3"
> > > > is producing ? the command should be printed as:
> > > > > execute 'mpirun ...
> > > > In my case, a successful run using OpenMPI gives me:
> > > > > execute 'mpirun -np 1 ./build_cpl/mitgcmuv : -np 1
> > ./build_ocn/mitgcmuv
> > > > : -np 1 ./build_atm/mitgcmuv' :
> > > > > MITCPLR_init1: 2 UV-Atmos MPI_Comm_create
> > > > MPI_COMM_compcplr= 6 ierr= 0
> > > > > MITCPLR_init1: 2 UV-Atmos component num= 2
> > > > MPI_COMM= 5 6
> > > > > MITCPLR_init1: 2 UV-Atmos Rank/Size = 1 /
> > > > 2
> > > > > MITCPLR_init1: 1 UV-Ocean Rank/Size = 1 /
> > > > 2
> > > > > CPL_READ_PARAMS: nCouplingSteps= 5
> > > > > runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap= 3644
> > > > > ROmap: 1 599 598 0.100280
> > > > > ROmap: 3644 4402 4403 0.169626
> > > > > Exporting (pid= 0 ) atmospheric fluxes at iter. 0
> > > > > Importing (pid= 0 ) oceanic fields at iteration 0
> > > > > Exporting (pid= 0 ) atmospheric fluxes at iter. 8
> > > > > Importing (pid= 0 ) oceanic fields at iteration 8
> > > > > Exporting (pid= 0 ) atmospheric fluxes at iter. 16
> > > > > Importing (pid= 0 ) oceanic fields at iteration 16
> > > > > Exporting (pid= 0 ) atmospheric fluxes at iter. 24
> > > > > Importing (pid= 0 ) oceanic fields at iteration 24
> > > > > Exporting (pid= 0 ) atmospheric fluxes at iter. 32
> > > > > Importing (pid= 0 ) oceanic fields at iteration 32
> > > > > STOP NORMAL END
> > > > > STOP NORMAL END
> > > >
> > > > Once all these steps are checked and are OK, can start to dig into the
> > > > coupling
> > > > log files.
> > > >
> > > > Cheers,
> > > > Jean-Michel
> > > >
> > > > On Wed, Oct 03, 2012 at 04:21:49PM -0230, taimaz.bahadory wrote:
> > > > > There is a "stdout" file generated in the main directory of run, with
> > > > these
> > > > > contents:
> > > > >
> > > > >
> > > >
> > ***********************************************************************************************************************************
> > > > > CMA: unable to get RDMA device list
> > > > > librdmacm: couldn't read ABI version.
> > > > > librdmacm: assuming: 4
> > > > > librdmacm: couldn't read ABI version.
> > > > > librdmacm: assuming: 4
> > > > > CMA: unable to get RDMA device list
> > > > > librdmacm: couldn't read ABI version.
> > > > > librdmacm: assuming: 4
> > > > > CMA: unable to get RDMA device list
> > > > >
> > > >
> > --------------------------------------------------------------------------
> > > > > [[9900,1],2]: A high-performance Open MPI point-to-point messaging
> > module
> > > > > was unable to find any relevant network interfaces:
> > > > >
> > > > > Module: OpenFabrics (openib)
> > > > > Host: glacdyn
> > > > >
> > > > > Another transport will be used instead, although this may result in
> > > > > lower performance.
> > > > >
> > > >
> > --------------------------------------------------------------------------
> > > > > CPL_READ_PARAMS: nCouplingSteps= 5
> > > > > runoffmapFile =>>runOff_cs32_3644.bin<<= , nROmap= 3644
> > > > > ROmap: 1 599 598 0.100280
> > > > > ROmap: 3644 4402 4403 0.169626
> > > > > [glacdyn:04864] 2 more processes have sent help message
> > > > > help-mpi-btl-base.txt / btl:no-nics
> > > > > [glacdyn:04864] Set MCA parameter "orte_base_help_aggregate" to 0 to
> > see
> > > > > all help / error messages
> > > > >
> > > >
> > ***********************************************************************************************************************************
> > > > >
> > > > > Maybe there would be some relation between these error-like messages
> > and
> > > > > run stuck!
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Oct 3, 2012 at 2:13 PM, taimaz.bahadory <
> > taimaz.bahadory at mun.ca
> > > > >wrote:
> > > > >
> > > > > > Re-Hi;
> > > > > >
> > > > > > Yes; as I said, it stuck again. I check the CPU. It is fully
> > loaded,
> > > > but
> > > > > > the output file is not updated! It is only a few seconds younger
> > than
> > > > the
> > > > > > run initiation.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Oct 3, 2012 at 1:32 PM, taimaz.bahadory <
> > > > taimaz.bahadory at mun.ca>wrote:
> > > > > >
> > > > > >> Hi;
> > > > > >>
> > > > > >> I guess I've tried it too, but the same problem occurred (I will
> > try
> > > > it
> > > > > >> again right now to check it again).
> > > > > >> Will report soon
> > > > > >> Thanks
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Wed, Oct 3, 2012 at 1:29 PM, Jean-Michel Campin <
> > jmc at ocean.mit.edu
> > > > >wrote:
> > > > > >>
> > > > > >>> Hi Taimaz,
> > > > > >>>
> > > > > >>> Can you try without MNC ? The current set-up (cpl_aim+ocn) does
> > not
> > > > > >>> use MNC (useMNC=.TRUE., is commented out in both
> > input_atm/data.pkg
> > > > > >>> and input_ocn/data.pkg) so if there was a problem in the coupled
> > > > set-up
> > > > > >>> code
> > > > > >>> with NetCDF output, might not have seen it (since I did not try
> > > > recently
> > > > > >>> with it).
> > > > > >>>
> > > > > >>> Cheers,
> > > > > >>> Jean-Michel
> > > > > >>>
> > > > > >>> On Tue, Sep 25, 2012 at 11:54:59AM -0230, taimaz.bahadory wrote:
> > > > > >>> > Hi everybody;
> > > > > >>> >
> > > > > >>> > I'm trying to run the coupled model example (cpl_aim+ocn) in
> > the
> > > > > >>> > verification directory. All the three first steps (Cleaning;
> > > > Compiling
> > > > > >>> and
> > > > > >>> > Making; Copying input files) passed with no error; but when I
> > run
> > > > the
> > > > > >>> > coupler, it starts and creates the netCDF output files
> > initially,
> > > > but
> > > > > >>> stops
> > > > > >>> > updating them and also the output files, although the three
> > > > "mitgcmuv"
> > > > > >>> > files are still running! It's like a program freezing.
> > > > > >>> > Has anybody been stuck in such a situation?
> > > > > >>>
> > > > > >>> > _______________________________________________
> > > > > >>> > MITgcm-support mailing list
> > > > > >>> > MITgcm-support at mitgcm.org
> > > > > >>> > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > > >>>
> > > > > >>>
> > > > > >>> _______________________________________________
> > > > > >>> MITgcm-support mailing list
> > > > > >>> MITgcm-support at mitgcm.org
> > > > > >>> http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >
> > > >
> > > > > _______________________________________________
> > > > > MITgcm-support mailing list
> > > > > MITgcm-support at mitgcm.org
> > > > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > >
> > > >
> > > > _______________________________________________
> > > > MITgcm-support mailing list
> > > > MITgcm-support at mitgcm.org
> > > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> > > >
> >
> > > _______________________________________________
> > > MITgcm-support mailing list
> > > MITgcm-support at mitgcm.org
> > > http://mitgcm.org/mailman/listinfo/mitgcm-support
> >
> >
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-support
> >
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
More information about the MITgcm-support
mailing list