[MITgcm-support] Stopping execution when NaNs are generated?
Constantinos Evangelinos
ce107 at ocean.mit.edu
Thu Apr 20 11:24:27 EDT 2006
On Thu 20 Apr 2006 05:52, Martin Losch wrote:
> Jody,
> has anyone answer you yet?
> The CLF criterion is evaluated along with the monitor statistics. But
> in fact, the model is stopped if the monitor statistics encounter
> unrealistic temperature valures (something like +/- 1e4). So if you
> decrease your monitorFreq, the model will catch the problems before
> the large numbers have turned into nans (at the cost of very frequent
> monitor statistics).
>
> Martin
>
> On Apr 14, 2006, at 10:02 PM, Jody Klymak wrote:
> > Hi All,
> >
> > Just getting started with MITgcm - I have some simple 2-D runs
> > going on a parallel Linux cluster here at UCSD.
> >
> > Is there a flag to set that will stop execution when the model run
> > hits the CFL criterion and starts to generate NaNs? Or should I
> > just decrease (increase?) monitorFreq? Rookie mistake, but I just
> > burnt up 5h of grid time running the model on NaNs.
> >
> > Thanks a lot, Jody
All the tricks mentioned above should work but sometimes NaNs are generated in
3 timesteps and that might be too frequent for monitorFreq. For a more
general solution that catches things at (or closer to) the moment of the
generation of an Infinity or a NaN and not before there are two options:
a) A generic approach: Provided your system has the GNU Scientific Library
(libgsl - usually available on most if not all Linux distributions) installed
(you can also install it yourself relatively easily on other O/Ses) and you
have a recent version of MITgcm you can use the -gsl flag that genmake2
provides. Then you can control things such as exception handling (for NaNs
and Infinities) and even precision control (for processors that provide that
like the x87 floating point unit) using environment variables, as described
in
http://www.gnu.org/software/gsl/manual/html_node/Setting-up-your-IEEE-environment.html#Setting-up-your-IEEE-environment
So you would simply set the following environment variable to:
GSL_IEEE_MODE="mask-underflow,mask-denormalized"
b) A platform specific one: Check for compiler flags that enable trapping for
floating point exceptions. For example on the Intel compilers,
-fpe<n>
Specifies floating-point exception handling for the main program
at run-time. You can specify one of the following values for
<n>:
0 - Floating underflow results in zero; all other floating-point
exceptions abort execution.
1 - Floating underflow results in zero; all other floating-point
exceptions produce exceptional values (signed Infinities or
NaNs) and execution continues.
3 - All floating-point exceptions produce exceptional values
(signed Infinities, denormals, or NaNs) and execution continues.
This is the default; it provides full IEEE support. (Also see
-ftz.)
and -fpe0 might be what you need. With the PGI compilers:
-Ktrap=[option,[option]...]
Controls the behavior of the processor when floating-point
exceptions occur. Possible options include fp, align (ignored),
inv, denorm, divz, ovf, unf, and inexact. -Ktrap is only
processed when compiling a main function/program. The options
inv, denorm, divz, ovf, unf, and inexzct correspond to the
processor’s exception mask bits invalid operation, denormalized
operand, divide-by-zero, overflow, underflow, and precision,
respectively. Normally, the processor’s exception mask bits are
on (floating-point exceptions are masked; the processor recovers
from exceptions and continues). If a floating-point exception
occurs and its corresponding mask bit is off (or "unmasked"),
execution terminates with an arithmetic exception (C’s FPE
signal). -Ktrap=fp is equivalent to -Ktrap=inv,divz,ovf.
and -Ktrap=fp should be what you need. On Alpha Tru64 systems:
-fpe Sets any calculated denormalized value (result) to zero and lets the
program continue. A message is displayed only if -check underflow is also
specified. Any use of a denormalized number (invalid data) in an arithmetic
expression results in an invalid operand error. The program stops, creating a
core dump file. Exceptional values are not allowed. The program terminates
after displaying a message and creating a core dump file. The exception
location is one or more instructions after the instruction that caused the
exception, unless -synchronous_exceptions was specified.
-fpe1 Sets any calculated denormalized value to zero and lets the program
continue. A message is displayed only if -check underflow is also specified.
Use of a denormalized (or exceptional) number in an arithmetic expression
results in program continuation, but with slower performance. The program
continues (no core dump). No message is displayed. A NaN or Infinity (+ or
--) exceptional value is generated.
-fpe2 Sets any calculated denormalized value to zero and lets the program
continue. A message is displayed ( -check underflow is not needed). Use of a
denormalized (or exceptional) number in an arithmetic expression results in
program continuation, but with slower performance. The program continues (no
core dump). A message is displayed a maximum of twice for each type of
exception. A NaN or Infinity (+ or --) is generated.
-fpe3 Leaves any calculated denormalized value as is. The program continues,
allowing gradual underflow. Use of a denormalized (or exceptional) number in
an arithmetic expression results in program continuation, but with slower
performance. A message is displayed only if -check underflow is also
specified. The program continues (no core dump). No message is displayed. A
NaN or Infinity (+ or --) is generated.
-fpe4 Leaves any calculated denormalized value as is. The program continues,
allowing gradual underflow. Use of a denormalized (or exceptional) number in
an arithmetic expression results in program continuation, but with slower
performance. A message is displayed ( -check underflow is not needed). The
program continues (no core dump). A message is displayed a maximum of twice
for each type of exception. A NaN or Infinity (+ or --) is generated.
Similar flags may exist on other compilers/operating systems, for example on
IBM AIX systems:
-qflttrap=invalid:zerodivide:overrflow:enable
and unfortunately may also come with a performance hit (depending on the
processor, as seen in the case of the Alpha). To reduce the impact of
checking for an exception flag on IBMs one uses
-qflttrap=invalid:zerodivide:overrflow:imprecise:enable
etc. You need to check your man pages to find out if these flags need to be
provided for every file, just the program main, at link time or all of the
above.
Constantinos
--
Dr. Constantinos Evangelinos Room 54-1518, EAPS/MIT
Earth, Atmospheric and Planetary Sciences 77 Massachusetts Avenue
Massachusetts Institute of Technology Cambridge, MA 02139
+1-617-253-5259/+1-617-253-4464 (fax) USA
More information about the MITgcm-support
mailing list