[MITgcm-support] STOP MOM_IMPLICIT_R: error when solving 3-Diag problem.

Sun Sep 13 19:10:16 EDT 2020

Hi Kunal

Besides what Jody already said.

1) Take a look at the code file (.f) lines in the backtraced error (STDERR
listed in your last email).

The floating point exception seems to have happened in the exf package
(which handles external forcings).
Note that the .f files should be in your build directory, not in the
original code tree (where .F90 files live).
They have been preprocessed to insert all the ".h" files and preprocessor
directives, so the line numbers are different
in the .f and .F90.

Sometimes it is hard to nail down which forcing file is responsible for the
failure,
but you should have a data.exf namelist file that lists the files you are
using at least.
Inspect your forcing files, make sure they are consistent with the grid,
and that there aren't NaNs, missing values, or zeroes where they are not
supposed to be.
The bulkformulae.f line 4598 may give a hint why you get a division by
zero, and hopefully
suggest which data in your forcing files is causing trouble.

2) Pay attention to the warning messages in STDERR.

They are mostly related to the MNC (netCDF IO) package, and suggest some
changes you
may consider adopting.
My recollection is that MNC doesn't support useSingleCpuIO,
which means that it will output one file per MPI subdomain,
which you will have to combine after the model runs.
[There are some matlab scripts for that, maybe Python too.]
I gave up completely using netCDF in MITgcm experiments because of this.
Output through all processors stresses out any computer or cluster,
especially when the number of subdomains is large, and if other users are
also pounding on IO.

The MITgcm still works better with binary IO (MDSIO),
where the binaries have a metadata text file (.meta) and the binary part
itself (.data).
That is unfortunate, causes much frustration, and errors that don't happen
in models that have
a robust netCDF IO, because binary files are error prone, producing and
inspecting binary files is a pain,
let alone doing data analysis on them, but that is the way it is.
If you don't have a parallel file system use MDSIO and
useSingleCpuIO=.true. is probably
the most stable way to run the model.
However, you will have to either read your binary output directly for QC
and data analysis,
or write scripts to convert them to netCDF for further processing.

Beware that the MITgcm has controls for the binary precision readBinaryPrec,
and writeBinaryPrec in the data namelist &PARM01.
Besides, the norm is to use "big endian" floating point format (which is
commonly part
of the compilation flags), so that even the visualization of binaries
requires swapping bytes
(because virtually all computers today are "little endian".

People (including myself) can spend a lot of time writing and
debugging these binary-to-netcdf  and netcdf-to-binary scripts,
something that would be obviated if the model had a complete and robust
netCDF IO package.

There are some tools for that in the MITgcm code itself, and I suggest that
you start by looking at
the existing Matlab scripts:
http://wwwcvs.mitgcm.org/viewvc/MITgcm/MITgcm/utils/matlab/
This is more organized set of scripts created by Martin Losch (I think):
http://wwwcvs.mitgcm.org/viewvc/MITgcm/MITgcm/verification/tutorial_global_oce_latlon/diags_matlab/

My recollection is that Jody wrote an extension to the MNC package,
not yet part of the mainstream code, though.

3) You can increase the frequency of monitor and data output, to achieve
what Jody suggested,
and inspect also STDOUT.XXXX.

If the model fails right at the start, then set them equal to one time step.
The output will be huge, but if it fails in the begining of the run,
this is no big deal:

monitorFreq and dumpFreq in data &PARM03

I hope this helps,
Gus Correa

I hope this helps,
Gus Correa

On Sun, Sep 13, 2020 at 11:23 AM Jody Klymak <jklymak at uvic.ca> wrote:

> Hi Kunal.
>
> Check STDOUT.0000 as that is more relevant.  Does the cfl criteria blow
> up?  If so, you will simply have to reduce the time step. Its possible you
> have also set the model up incorrectly and it is convecting initially,
> which will easily violate the cfl criteria.  It is also hard to tell fro
> what you are giving us when the model blows up.  time step 1?  time step
> 100? Finally, its often useful to plot the fields to see where the
> instability is happening.  That may require you to save quite a bit of
> data, but its hard to debug in the absence of information.
>
> Best of luck!  Jody
>
> On 13 Sep 2020, at 07:22, kunal madkaiker <kunal.madkaiker02 at gmail.com>
> wrote:
>
> Hi Gus,
>
> As per your suggestion, I made the respective changes and tried to run the
> executable again. Below is the log generated
>
> $ mpirun -np 60 ./mitgcmuv
> forrtl: error (72): floating overflow
> Image              PC                                    Routine
>           Line          Source
> libifcoremt.so.5   00002AD443246555         for__signal_handl     Unknown
>     Unknown
> libpthread-2.17.s  00002AD442DB35F0       Unknown               Unknown
>       Unknown
> libnetcdf.so.15.2  00002AD44121C4B3       __libm_exp_e7         Unknown
>     Unknown
> mitgcmuv           0000000000AC0FF7        exf_bulkformulae_        4598
>         exf_bulkformulae.f
> mitgcmuv           0000000000B02334         exf_getforcing_          4430
>           exf_getforcing.f
> mitgcmuv           000000000128726E         load_fields_drive        2141
>           load_fields_driver.f
> mitgcmuv           0000000000C45A25        forward_step_            2340
>           forward_step.f
> mitgcmuv           0000000001290200         main_do_loop_            2078
>         main_do_loop.f
> mitgcmuv           0000000001C283F6         the_main_loop_           2097
>        the_main_loop.f
> mitgcmuv           0000000001C28955         the_model_main_          2421
>      the_model_main.f
> mitgcmuv           0000000001290615         MAIN__                   4286
>            main.f
> mitgcmuv           0000000000403412         Unknown
> Unknown      Unknown
> libc-2.17.so       00002AD445C2A505        __libc_start_main    Unknown
>      Unknown
> mitgcmuv           0000000000403319         Unknown
> Unknown      Unknown
>
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
> The STDERR file reads:
> (PID.TID 0030.0001) ** WARNING ** MNC_READPARMS: incomplete MNC pickup
> files implementation
> (PID.TID 0030.0001) ** WARNING ** MNC_READPARMS: => pickup_write_mnc=T not
> recommanded
> (PID.TID 0030.0001) ** WARNING ** MNC_READPARMS: => pickup_read_mnc=T not
> working for some set-up
> (PID.TID 0030.0001) ** WARNING ** INI_MODEL_IO: globalFiles=TRUE is not
> safe in Multi-processors (MPI) run
> (PID.TID 0030.0001) ** WARNING ** INI_MODEL_IO: use instead
> "useSingleCpuIO=.TRUE."
> (PID.TID 0030.0001) ** WARNING ** INI_MODEL_IO: use tiled-files to write
> sections (for OBCS)
> (PID.TID 0030.0001) ** WARNING ** EXF_CHECK: wind-stress position
> irrelevant
>
> Attaching data, data.obcs, data.exf for your reference. I have set
> deltaTmom=120.0,
> What I am understanding is that the model is blowing up due to
> overestimation of few values and not because of any error. Am I right?
>
> Regards
> Kunal
>
> On Sun, Sep 13, 2020 at 6:47 AM Gus Correa <gus at ldeo.columbia.edu> wrote:
>
>> Hi Kunal
>>
>> To try to nail down where, when, why it fails you could compile in
>> debugging mode,
>> ie. start fresh ('make CLEAN' in the build directory or just wipe that
>> directory off)
>> and run gemake2 with the -devel flag (keep the other flags).
>> Then, to increase verbosity add:
>> debugLevel = 4,
>> to the "data" namelist &PARM01,
>> and increase the
>> monitorFreq
>> in &PARM03
>> to one or a few time steps.
>> The STDOUT.XXXX, and STDERR.XXXX files
>> may give a hint of what is going on (when, where, wny it fails).
>>
>> I hope this helps,
>> Gus Correa
>>
>> On Sat, Sep 12, 2020 at 7:38 PM kunal madkaiker <
>> kunal.madkaiker02 at gmail.com> wrote:
>>
>>> Dear All,
>>>
>>> I am trying to simulate U,V currents circulation along the West Coast of
>>> India.
>>> I have a grid of 720 x 1560 with a high resolution of 1.45km x 1.45km,
>>> with 25 levels in the vertical from 0 to 2150m. I have set hFacMin=0.3 and
>>> hFacMinDz=10
>>>
>>> But model blows up at the initial stage and I get the error:
>>> Note: The following floating-point exceptions are signalling:
>>> IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG
>>> STOP MOM_IMPLICIT_R: error when solving 3-Diag problem.
>>>
>>> I have tried changing viscAh from 1 to 1000 m2/s and viscAz from 0.02 to
>>> 0.001 m2/s. Also tried with viscAhgrid=0.1.
>>> I have defined the vertical levels keeping delZ(k+1)/delZ(k) < 1.4 ratio
>>> in mind. But the issue persists. Kindly advise.
>>> Please let me know if any additional information is required from my
>>> side.
>>>
>>> Regards
>>> Kunal
>>>
>>> _______________________________________________
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>
> <data><data.obcs><data.exf><STDOUT.0025>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
>
> --
> Jody Klymak
> http://ocean-physics.seos.uvic.ca/~jklymak/
>
>
>
>
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20200913/2f617436/attachment-0001.html>