[MITgcm-devel] problem with my last checkin

Jean-Michel Campin jmc at ocean.mit.edu
Tue Feb 5 12:51:10 EST 2008


Hi Chris,

Here is what I found in MITgcm/verification/fizhi-cs-aqualev20/input/eedata :
# unlimit the stack size for the FIZHI rad code
#EH3  useSETRLSTK is commented out (default: false) since, with the g77 
#EH3  compiler, it causes the model to hang *without* returning -- thus
#EH3  killing all our automated g77 testing on, for instance, the ACES 
#EH3  cluster.
#EH3  useSETRLSTK=.TRUE.,

I've just tried to run this test with g77+mpi on aces, with useSETRLSTK=.TRUE.,
and it seems to work (does not fix the problem with g77+mpi on aces,
but does not hang and finish with same error as without useSETRLSTK=.TRUE.).

But things are not completely clear, because I get a different error:
> forrtl: severe (36): attempt to access non-existent record, unit 9, file /home/jmc/gcm_current/verification/fizhi-cs-32x32x40/run/dxC1_dXYa.face001.bin
> Image              PC        Routine            Line        Source             
> mitgcmuv           083E380C  Unknown               Unknown  Unknown

which is not what it use to be:
> /usr/local/pkg/mpich/mpich-gcc/bin/mpirun: line 1: 29506 Segmentation fault 
Will see.

Jean-Michel

On Tue, Feb 05, 2008 at 10:59:44AM -0500, chris hill wrote:
> Hi All,
> 
>  In theory if you have
>  useSETRLSTK=.TRUE.,
>  in "eedata" the stack will get automatically unlimited
>  on Linux systems (and other systems that support Posix setrlimit() ).
>  genmake tests for setrlimit(), so that where it doesn't exist
>  it should be #ifdef'd out.
>  I remember there were some problems with this in the past, but I just 
> took a look and the code looks OK to me, so maybe we should try 
> activating this again?
> 
> Chris
> 
> Jean-Michel Campin wrote:
> >Hi Martin,
> >
> >On Tue, Feb 05, 2008 at 03:49:17PM +0100, Martin Losch wrote:
> >>No, I didn't, stupid me. The seg-fault goes away with "unlimit", but  
> >>still I don't see how my changes lead to a stack overflow.
> >>
> >>Also, is the "unlimit" taken care of with the automated tests?
> >Yes and No: it's among the 1rst command at the top of the script:
> >MITgcm/tools/example_scripts/faulks/test_csail: lines 19 & 20;
> >>#  Turn off stack limit for FIZHI
> >>ulimit -s unlimited
> >
> >Jean-Michel
> >
> >>Martin
> >>On 5 Feb 2008, at 15:33, Patrick Heimbach wrote:
> >>
> >>>Did you try the "usual" seg fault candidate first:
> >>>
> >>>unlimit
> >>>
> >>>-p.
> >>>
> >>>On Feb 5, 2008, at 9:06 AM, Martin Losch wrote:
> >>>
> >>>>Hi there,
> >>>>
> >>>>I have just checked in some functionality that is handy for  
> >>>>rotated spherical grids (basically a few new scalar variables in  
> >>>>PARAMS.h and a new subroutine that recomputes XC/YC/XG/YG in one  
> >>>>special case. The new code does not change the verification  
> >>>>experiments on my linux_ia32_g77-machine. But now I am rerunning  
> >>>>testreport on hugo.csail.mit.edu with the same build_options_file  
> >>>>and I get segmentation faults (linux_ia32_g77) for fizhi-cs- 
> >>>>aqualev and fizhi-cs-32x32x40. Everything else is OK.
> >>>>
> >>>>I have tried to find the problem, but the segmentation fault  
> >>>>happens somewhere within fizhi, here' s the debugger output
> >>>>>Program received signal SIGSEGV, Segmentation fault.
> >>>>>0x0805aca8 in solir_ (m=0xbf897574, n=0x8581468, ndim=0x8581468,  
> >>>>>np=0x85810e0, wh=0xbf114b50, taucld=0xbf24cde0,  
> >>>>>tauclb=0xbf1c8b80, tauclf=0xbf18cb70, reff=0xbf2c4df0,  
> >>>>>ict=0xbf897598, icb=0xbf89759c,
> >>>>>   fcld=0xbf574f40, cc=0xbf204b90, taual=0xbf210dd0,  
> >>>>>csm=0xbefdcad0, rsirbm=0xbf47aec0, rsirdf=0xbf478eb0,  
> >>>>>flx=0xbf42ee40, flc=0xbf3f0e30, fdirir=0xbf472e80,  
> >>>>>fdifir=0xbf470e70) at fizhi_swrad.f:1732
> >>>>>1732               ssaclt(i,j)=1.0
> >>>>>Current language:  auto; currently fortran
> >>>>>
> >>>>I have no idea what's going on, and I can't even run g77 -fbounds- 
> >>>>check, because in fizhi, there are so many assignments where this  
> >>>>array bound check chockes, e.g. variable mndy(12,4) is access via  
> >>>>DO I=1,48; mnc(I,1)=...; ENDDO, which is technically correct by  
> >>>>makes it impossible to debug these files. What am I to do? Remove  
> >>>>my changes again?
> >>>>
> >>>>Martin
> >>>>
> >>>>_______________________________________________
> >>>>MITgcm-devel mailing list
> >>>>MITgcm-devel at mitgcm.org
> >>>>http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >>>---
> >>>Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/~heimbach
> >>>MIT | EAPS 54-1518 | 77 Massachusetts Ave | Cambridge MA 02139 USA
> >>>FON +1-617-253-5259 | FAX +1-617-253-4464 | SKYPE patrick.heimbach
> >>>
> >>>
> >>>_______________________________________________
> >>>MITgcm-devel mailing list
> >>>MITgcm-devel at mitgcm.org
> >>>http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >>_______________________________________________
> >>MITgcm-devel mailing list
> >>MITgcm-devel at mitgcm.org
> >>http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >_______________________________________________
> >MITgcm-devel mailing list
> >MITgcm-devel at mitgcm.org
> >http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel



More information about the MITgcm-devel mailing list