[MITgcm-support] Mysterious initialisation problem - debugging options?

Thu Aug 22 12:15:08 EDT 2013

Just an update - There may be a PGI compiler bug, which I'm investigating with the tech people at the Hector supercomputer.  In particular, the problem seems to be similar to that discussed at the link below given the fsd_exp error message and the occurrence of the problem in a loop with an exponential function.
http://www.pgroup.com/userforum/viewtopic.php?t=3884&sid=556649fa763c95d737c457dd1ed310d9

Furthermore, I've printed out the values being generated by the model in the exponential function on line 105 in swfrac.F to STDOUT.  There is no sign of a blow-up. Instead some of the inputs to the exponential function have zeros in the numerator (rather than a large negative real number), so that the exponential function evaluates to 1.  This is consistent with the bug discussed in the link above where a floating point exception is generated although nothing is actually blowing up.

If anyone else has experienced this issue, I would appreciate it if you get in touch.  Many thanks to those of you who have already been very helpful.

Liam
________________________________
From: Liam Brannigan
Sent: 22 August 2013 12:33
To: mitgcm-support at mitgcm.org
Subject: Mysterious initialisation problem - debugging options?

Hi Jean-Michel (& everyone else)

Thanks for your suggestions.  I added debugMode=.TRUE., to the eedata file and got the full
output - but only for the nIter=0 case.  If I make it do a timestep it still
crashes without flushing anything to the STDOUT/STDERR files.  The options you described in the
Makefile and genmake.log below are not included. I tried to add
-DHAVE_FLUSH to the defines list, after I ran genmake2 but the <make> command didn't like it and wouldn't compile until I deleted it.  Is there a way of adding
these in somewhere else?

To proceed with the debugging, I've been looking at the core files generated in the case where the model crashes trying to do the first timestep.
(For the sake of the archive I've set out how to do this below)
The output from the debugger program is as follows (with useKPP=.FALSE. in data.pkg):
#0  0x0000000000b930f0 in __fsd_exp ()
#1  0x00000000007238b1 in swfrac () at ./swfrac.f:411
#2  0x00000000006a38ce in external_forcing_t () at ./external_forcing.f:9208
#3  0x000000000067617c in calc_gt () at ./calc_gt.f:2096
#4  0x0000000000727880 in thermodynamics () at ./thermodynamics.f:3640
#5  0x00000000006b6553 in forward_step () at ./forward_step.f:1979
#6  0x0000000000723a35 in the_main_loop () at ./the_main_loop.f:1779
#7  0x0000000000723b95 in the_model_main () at ./the_model_main.f:2267
#8  0x0000000000657ab0 in main () at ./main.f:4290
#9  0x000000000040070e in main ()

This implies that the problem is in swfrac.f and that the changes I made to the kpp files are not the (direct) cause of the problem.
I have also run with useKPP=.TRUE. and
the model generates a floating point exception anywhere swfrac.f is called (e.g. in BLDEPTH.f or
KPP_calc.f).  This is strange, however, as I'm not using radiative heating because I simply apply a heat flux at the surface.
I should note that I haven't modified the swfrac.F file (and a <diff> of sw_frac.f in experiments which do and don't run shows no differences).

The swfrac.F (v.1.15) routine isn't particularly complicated, but I'm not sure if it does anything at all in the event that an imposed
heat flux is being used.  If it isn't doing anything, I could hack the solution by commenting out the loop:
      DO i = 1,imax
        facz = fact*swdk(i)
        IF ( facz .LT. -200. _d 0 ) THEN
          swdk(i) = 0. _d 0
        ELSE
          swdk(i) =       rfac(jwtype)  * exp( facz/a1(jwtype) )
     &       + (1. _d 0 - rfac(jwtype)) * exp( facz/a2(jwtype) )
        ENDIF
      ENDDO

Obviously though, I'd prefer to know what has caused this change in behaviour (the model does run with the hack though). Does anyone have
an idea of what could cause this?

Core files - usage
These are binary files generated in linux when an application crashes.  It is in the same directory as the model output
and is simply called 'core'.
The core files must be accessed using a special debugger program called gdb.  The gdb program needs to know the path to the file
that generated the model (mitgcmuv) and the core file. To run it enter <gdb path-to-mitgcmuv path-to-core [enter]>.
Typing  <backtrace [enter]> will set out the list of routines that have been called, with the most recent first, to get the output above.

Liam

Hi Liam,

I would suggest to set:
 debugMode=.TRUE.,
in parameter file "eedata".
In addition to writing more things to STDOUT (e.g., which routines are
called), it also flushes the STDOUT unit so that you will have a better
chance to see something in STDOUT (& STDERR).

This "flush" command is not always available, but you can check for this
a) in the Makefile: -DHAVE_FLUSH in DEFINES list
or
b) at the bottom of genmake.log:
--> set HAVE_FLUSH='t'

Cheers,
Jean-Michel

On Tue, Aug 20, 2013 at 05:31:12PM +0000, Liam Brannigan wrote:
> Dear MITgcmers
>
> I was adding new diagnostics to the kpp package on my version of the model yesterday.  After making one small change and re-compiling, I have not been able to get the model to run.  This is causing me some heartache, as I'm getting very little information about why it won't run.  I'd appreciate on thoughts you had on further debugging options.
>
> The scripts I modified were kpp_diagnostics_init.F and kpp_routines.F (which I've modified in a similar way a number of times before).  I have tried restoring these to how they were before the re-compile, but to no avail.  I've also tried replacing them from the website (taking care to get the right version number) and running with useKPP=.FALSE., in data.pkg to the same effect.
>
> When I submit the job to the Hector supercomputer the mnc output folders are created and are populated by the grid, monitor, monitor_grid and state files at the initial time step.  This suggests that the model is initializing at least as far as WRITE_STATE and MONITOR (just about the last calls before the forward step begins) on the call tree http://mitgcm.org/sealion/code_reference/callTree.html.
> I also tried to get the model to output a diagnostic (drhodr) which is calculated early in the time step, but nothing happened.
>
> I ran the model with nIter=0, which ran without producing the exit code 136 in output.txt, suggesting that initialization is indeed working (the .e file stated 'NORMAL END' in this case).
>
> I'm not sure how to proceed further, as the STDERR and STDOUT files are empty after the model runs and the job files simply give a message like: "_pmiu_daemon(SIGCHLD): PE RANK 8 exit signal Floating point exception".
>
> Jean-Michel mentioned in a previous developer post (http://mitgcm.org/pipermail/mitgcm-devel/2011-May/004763.html) about accessing further debug info, but I'm not sure how this is switched on (and where the output can be accessed).  I tried adding debugLevel=4 to the data file in PARAM01 as suggested in the online manual, but nothing happened.
>
> Apologies for the flood of info, thanks for any advice you can offer!
>
> Liam

> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org<http://mitgcm.org/mailman/listinfo/mitgcm-support>
> http://mitgcm.org/mailman/listinfo/mitgcm-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20130822/3617e21b/attachment.htm>