[MITgcm-devel] help with debugging flags during compilation

Matthew Mazloff mmazloff at ucsd.edu
Thu Mar 1 23:37:39 EST 2018


Hi An

This seems so familiar to me!  I really think this is a memory issue and not related to mitgcm code. The machine thinks you need too much memory and kills your job.

Are you using a PBS job scheduler? On the last platform I ran on I had jobs crashing randomly all the time until I added:
#PBS -l pvmem=299gb
Never a problem again after adding that one line!

Matt




> On Mar 1, 2018, at 6:05 AM, An Nguyen <antnguyen13 at gmail.com> wrote:
> 
> hi Jean-Michel, Matt, and David,
> 
> Just thought I updated you on this.  We're still crashing, but at least now, after turning on debugMode=.TRUE., I think we narrow it down to eeboot_minimal.f (still nothing to STDOUT or STDERR, but it prints the error to a .err file). I'm including the error below in case you have encountered before and have more pointers for us on how to fix this?  I've changed nchklev_1 to a very small number (=6) , and asked for two extra nodes (= 2x48 more cpus, 2x192GB more memory) and it's still crashing with the same error.  Perhaps there's something I'm missing in eeboot?   (Although Arash has just now reported he had succeeded in getting an STDOUT file and it's barely working, with even more requested memory, so it seems to require quite a bit more memory that we thought (?) ).
> 
> Matt, to answer your question, i re-configured my regional domain to base on llc90 grid (regional, with obcs, identical setup, just much smaller) and it runs OK.  Just the higher res (llc270-based) that we're encountering this on the TACC machines.
> 
> cheers,
> An
> 
> forrtl: error (78): process killed (SIGTERM)
> Image              PC                Routine            Line        Source
> libifcoremt.so.5   00002B24B48FFECF  for__signal_handl     Unknown  Unknown
> libpthread-2.17.s  00002B24B6E245E0  Unknown               Unknown  Unknown
> libpthread-2.17.s  00002B24B6E236EE  read                  Unknown  Unknown
> libmpi.so.12.0     00002B24B34E7919  Unknown               Unknown  Unknown
> libmpi.so.12.0     00002B24B34E128B  Unknown               Unknown  Unknown
> libmpi.so.12.0     00002B24B34E6E91  Unknown               Unknown  Unknown
> libmpi.so.12.0     00002B24B3387F61  Unknown               Unknown  Unknown
> libmpi.so.12.0     00002B24B331E54B  Unknown               Unknown  Unknown
> libmpi.so.12       00002B24B330BC4B  MPI_Init              Unknown  Unknown
> libmpifort.so.12.  00002B24B3DE5240  MPI_INIT              Unknown  Unknown
> mitgcmuv_ad        00000000009336CD  eeboot_minimal_          1598  eeboot_minimal.f
> mitgcmuv_ad        0000000000933578  eeboot_                  1627  eeboot.f
> mitgcmuv_ad        0000000000985007  MAIN__                   4453  main.f
> mitgcmuv_ad        000000000042D2EE  Unknown               Unknown  Unknown
> libc-2.17.so       00002B24B7256C05  __libc_start_main     Unknown  Unknown
> mitgcmuv_ad        000000000042D1D9  Unknown               Unknown  Unknown
> 
>> On Feb 28, 2018, at 8:20 AM, Jean-Michel Campin <jmc at mit.edu> wrote:
>> 
>> Hi,
>> 
>> Just to clarify: debugMode=.TRUE. in eedata results in flushing buffer IO to
>> STDOUT & STDERR any time something is written.
>> This should prevent getting empty STDOUT and STDERR files when something
>> has been written.
>> 
>> Cheers,
>> Jean-Michel
>> 
>> On Wed, Feb 28, 2018 at 12:14:58PM +0000, David Ferreira wrote:
>>> Might be worth to run interactively if you can. On one of the machines I use, the STDOUT and STDERR are filled up with a bit lag, and sometimes in case of a crash they remain completely empty while I suspect something happened. In interactive mode, you may see if indeed something happens.
>>> cheers,
>>> david
>>> 
>>> ________________________________
>>> From: MITgcm-devel [mitgcm-devel-bounces at mitgcm.org] on behalf of Matthew Mazloff [mmazloff at ucsd.edu]
>>> Sent: Tuesday, February 27, 2018 11:07 PM
>>> To: <MITgcm-devel at mitgcm.org>
>>> Cc: Bigdeli, Arash
>>> Subject: Re: [MITgcm-devel] help with debugging flags during compilation
>>> 
>>> Hi An
>>> 
>>> In my experience this is always due to requesting too much memory (or a node thinking you are requesting too much memory). Does this every happen when you run smaller setups? (Or, perhaps, when you run the same setup but with half the vertical levels?)
>>> 
>>> -Matt
>>> 
>>> 
>>> 
>>> 
>>> On Feb 27, 2018, at 5:41 PM, An Nguyen <antnguyen13 at gmail.com<mailto:antnguyen13 at gmail.com>> wrote:
>>> 
>>> hi Jean-Michel,
>>> 
>>> Thank you for the suggestion, no I did not run with debugMode=.TRUE. in eedata, I might try that now.  I will use -convert big_endian and -assume byterecl, and -mcmodel=large.
>>> 
>>> I think the problem we're encountering here is that it's "not working", and we just can not figure out where the mitgcmuv fails, whether it is the mitgcmuv itself or the infrastructure of the computing node that fails.  We do not get any message at all, just "exit" message by the nodes , no STDOUT or STDERR or .err or .out files to even understand where the problem is.  So I would like to run the mitgcmuv (and primarily its _ad version where we have the issue) in the debugging mode to see if I can go line-by-line until it crashes to try to narrow down where the problem is.
>>> 
>>> Thanks,
>>> An
>>> 
>>> 
>>> On Feb 27, 2018, at 8:09 PM, Jean-Michel Campin <jmc at mit.edu<mailto:jmc at mit.edu>> wrote:
>>> 
>>> Hi An,
>>> 
>>> I am not sure I understand correctly:
>>> Trying to use the simplest CFLAGS, FFLAGS, FOPTIM setting might not be
>>> the best to compile and run.
>>> For instance, without:
>>> -convert big_endian -assume byterecl
>>> it will not be abble to read any "big-endian" binary, which could be a problem.
>>> Or without:
>>> -mcmodel=medium
>>> it might not run if the memory footprint is too large.
>>> 
>>> I found that, generally, once I load the right module and set env.variable:
>>> MPI_INC_DIR to correct location,
>>> I can use one of the 3 "standard" optfile:
>>> linux_amd64_gfortran
>>> linux_amd64_ifort11
>>> linux_amd64_ifort+impi
>>> and it just works.
>>> And the improvement that I would get by fine tuning of some of the compiler
>>> options is not very significant (and in my case, not worth spending too much time).
>>> 
>>> And the nice thing about these ones is that you could use "-ieee" or "-devel"
>>> genmake2 option to turn off all optimisation level (-ieee) or even turn on
>>> all debug option (-devel).
>>> 
>>> Now in your case, are you running with debugMode=.TRUE., in eedata ?
>>> 
>>> Cheers,
>>> Jean-Michel
>>> 
>>> On Tue, Feb 27, 2018 at 04:24:04PM -0500, An Nguyen wrote:
>>> Hello mitgcm gurus,
>>> 
>>> For the last 2 years I've been having problem with the mitgcm crashing right out of the gate without producing any STDOUT or STDERR or any message at all on the various computing nodes we have at UT Austin, and the only help tech-support can provide us so far is a message saying that we need to do our own debugging.
>>> 
>>> I'd like to ask for some help with the flags needed to remove all optimization options when compiling the mitgcm (with ifort) in order to enter debug mode such as dbg or this ddt tool they suggested (https://portal.tacc.utexas.edu/software/ddt <https://portal.tacc.utexas.edu/software/ddt>) .  The genmake2 various flags are very elaborate with various options for CFLAGS, FFLAGS, FOPTIM that are really beyond my comprehension (I spent 5 hours last night fiddling around and didn't succeed), so I'm hoping to get some suggestions on what I need to set: would stripping ALL CFLAGS and FFLAGS and FOPTIM and only use
>>> 
>>> FFLAGS='-g'
>>> FOPTIM='-O0'
>>> CFLAGS=''
>>> 
>>> an option?  Any suggestion, including a sampled very-stripped-down optfile would be very helpful for me.  I can provide more information (within my comprehension) if needed.
>>> 
>>> Many thanks,
>>> An
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-devel/attachments/20180301/a3814c47/attachment.html>


More information about the MITgcm-devel mailing list