[MITgcm-support] Stopping execution when NaNs are generated?

Thu Apr 20 11:24:27 EDT 2006

On Thu 20 Apr 2006 05:52, Martin Losch wrote:

> Jody,
> has anyone answer you yet?
> The CLF criterion is evaluated along with the monitor statistics. But
> in fact, the model is stopped if the monitor statistics encounter
> unrealistic temperature valures (something like +/- 1e4). So if you
> decrease your monitorFreq, the model will catch the problems before
> the large numbers have turned into nans (at the cost of very frequent
> monitor statistics).
>
> Martin
>
> On Apr 14, 2006, at 10:02 PM, Jody Klymak wrote:
> > Hi All,
> >
> > Just getting started with MITgcm - I have some simple 2-D runs
> > going on a parallel Linux cluster here at UCSD.
> >
> > Is there a flag to set that will stop execution when the model run
> > hits the CFL criterion and starts to generate NaNs?  Or should I
> > just decrease (increase?) monitorFreq?  Rookie mistake, but I just
> > burnt up 5h of grid time running the model on NaNs.
> >
> > Thanks a lot,  Jody

All the tricks mentioned above should work but sometimes NaNs are generated in 
3 timesteps and that might be too frequent for monitorFreq. For a more 
general solution that catches things at (or closer to) the moment of the 
generation of an Infinity or a NaN and not before there are two options:

a) A generic approach: Provided your system has the GNU Scientific Library 
(libgsl - usually available on most if not all Linux distributions) installed 
(you can also install it yourself relatively easily on other O/Ses) and you 
have a recent version of MITgcm you can use the -gsl flag that genmake2 
provides. Then you can control things such as exception handling (for NaNs 
and Infinities) and even precision control (for processors that provide that 
like the x87 floating point unit) using environment variables, as described 
in 
http://www.gnu.org/software/gsl/manual/html_node/Setting-up-your-IEEE-environment.html#Setting-up-your-IEEE-environment

So you would simply set the following environment variable to:

GSL_IEEE_MODE="mask-underflow,mask-denormalized"

b) A platform specific one: Check for compiler flags that enable trapping for 
floating point exceptions. For example on the Intel compilers, 

       -fpe<n>
              Specifies floating-point exception handling for the main program
              at run-time.  You can specify one of the  following  values  for
              <n>:

              0 - Floating underflow results in zero; all other floating-point
              exceptions abort execution.

              1 - Floating underflow results in zero; all other floating-point
              exceptions  produce  exceptional  values  (signed  Infinities or
              NaNs) and execution continues.

              3 - All floating-point  exceptions  produce  exceptional  values
              (signed Infinities, denormals, or NaNs) and execution continues.
              This is the default; it provides full IEEE support.   (Also  see
              -ftz.)

and -fpe0 might be what you need. With the PGI compilers:

       -Ktrap=[option,[option]...]
              Controls the behavior of the processor when floating-point
              exceptions occur. Possible options include fp, align (ignored),
              inv, denorm, divz, ovf, unf, and inexact. -Ktrap is only
              processed when compiling a main function/program. The options
              inv, denorm, divz, ovf, unf, and inexzct correspond to the
              processor’s exception mask bits invalid operation, denormalized
              operand, divide-by-zero, overflow, underflow, and precision,
              respectively. Normally, the processor’s exception mask bits are
              on (floating-point exceptions are masked; the processor recovers
              from exceptions and continues). If a floating-point exception
              occurs and its corresponding mask bit is off (or "unmasked"),
              execution terminates with an arithmetic exception (C’s FPE
              signal).  -Ktrap=fp is equivalent to -Ktrap=inv,divz,ovf.

and -Ktrap=fp should be what you need. On Alpha Tru64 systems:

-fpe    	 Sets any calculated denormalized value (result) to zero and lets the 
program continue. A message is displayed only if  -check underflow   is also 
specified. Any use of a denormalized number (invalid data) in an arithmetic 
expression results in an invalid operand error. The program stops, creating a 
core dump file.  Exceptional values are not allowed. The program terminates 
after displaying a message and creating a core dump file. The exception 
location is one or more instructions after the instruction that caused the 
exception, unless  -synchronous_exceptions   was specified.
-fpe1 	Sets any calculated denormalized value to zero and lets the program 
continue. A message is displayed only if -check underflow is also specified. 
Use of a denormalized (or exceptional) number in an arithmetic expression 
results in program continuation, but with slower performance. The program 
continues (no core dump). No message is displayed. A NaN or Infinity (+ or 
--) exceptional value is generated.
-fpe2 	Sets any calculated denormalized value to zero and lets the program 
continue. A message is displayed ( -check underflow is not needed). Use of a 
denormalized (or exceptional) number in an arithmetic expression results in 
program continuation, but with slower performance. 	The program continues (no 
core dump). A message is displayed a maximum of twice for each type of 
exception. A NaN or Infinity (+ or --) is generated.
-fpe3 	Leaves any calculated denormalized value as is. The program continues, 
allowing gradual underflow. Use of a denormalized (or exceptional) number in 
an arithmetic expression results in program continuation, but with slower 
performance. A message is displayed only if -check underflow is also 
specified. The program continues (no core dump). No message is displayed. A 
NaN or Infinity (+ or --) is generated.
-fpe4 	Leaves any calculated denormalized value as is. The program continues, 
allowing gradual underflow. Use of a denormalized (or exceptional) number in 
an arithmetic expression results in program continuation, but with slower 
performance. A message is displayed ( -check underflow is not needed). 	The 
program continues (no core dump). A message is displayed a maximum of twice 
for each type of exception. A NaN or Infinity (+ or --) is generated. 

Similar flags may exist on other compilers/operating systems, for example on 
IBM AIX systems:

-qflttrap=invalid:zerodivide:overrflow:enable

and unfortunately may also come with a performance hit (depending on the 
processor, as seen in the case of the Alpha). To reduce the impact of 
checking for an exception flag on IBMs one uses

-qflttrap=invalid:zerodivide:overrflow:imprecise:enable

etc. You need to check your man pages to find out if these flags need to be 
provided for every file, just the program main, at link time or all of the 
above.

Constantinos
-- 
Dr. Constantinos Evangelinos                    Room 54-1518, EAPS/MIT
Earth, Atmospheric and Planetary Sciences       77 Massachusetts Avenue
Massachusetts Institute of Technology           Cambridge, MA 02139
+1-617-253-5259/+1-617-253-4464 (fax)           USA