<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Hi An<div class=""><br class=""></div><div class="">This seems so familiar to me! I really think this is a memory issue and not related to mitgcm code. The machine thinks you need too much memory and kills your job.</div><div class=""><br class=""></div><div class=""><span style="background-color: rgb(255, 255, 255);" class="">Are you using a PBS job scheduler? On the last platform I ran on I had jobs crashing randomly all the time until I added:</span></div><span style="background-color: rgb(255, 255, 255);" class="">#PBS -l pvmem=299gb</span><div class="">Never a problem again after adding that one line!</div><div class=""><br class=""></div><div class="">Matt</div><div class=""><br class=""></div><div class=""><br style="background-color: rgb(255, 255, 255);" class=""><div class=""><div><br class=""></div><div><br class=""><blockquote type="cite" class=""><div class="">On Mar 1, 2018, at 6:05 AM, An Nguyen <<a href="mailto:antnguyen13@gmail.com" class="">antnguyen13@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">hi Jean-Michel, Matt, and David,<br class=""><br class="">Just thought I updated you on this. We're still crashing, but at least now, after turning on debugMode=.TRUE., I think we narrow it down to eeboot_minimal.f (still nothing to STDOUT or STDERR, but it prints the error to a .err file). I'm including the error below in case you have encountered before and have more pointers for us on how to fix this? I've changed nchklev_1 to a very small number (=6) , and asked for two extra nodes (= 2x48 more cpus, 2x192GB more memory) and it's still crashing with the same error. Perhaps there's something I'm missing in eeboot? (Although Arash has just now reported he had succeeded in getting an STDOUT file and it's barely working, with even more requested memory, so it seems to require quite a bit more memory that we thought (?) ).<br class=""><br class="">Matt, to answer your question, i re-configured my regional domain to base on llc90 grid (regional, with obcs, identical setup, just much smaller) and it runs OK. Just the higher res (llc270-based) that we're encountering this on the TACC machines.<br class=""><br class="">cheers,<br class="">An<br class=""><br class="">forrtl: error (78): process killed (SIGTERM)<br class="">Image PC Routine Line Source<br class="">libifcoremt.so.5 00002B24B48FFECF for__signal_handl Unknown Unknown<br class="">libpthread-2.17.s 00002B24B6E245E0 Unknown Unknown Unknown<br class="">libpthread-2.17.s 00002B24B6E236EE read Unknown Unknown<br class="">libmpi.so.12.0 00002B24B34E7919 Unknown Unknown Unknown<br class="">libmpi.so.12.0 00002B24B34E128B Unknown Unknown Unknown<br class="">libmpi.so.12.0 00002B24B34E6E91 Unknown Unknown Unknown<br class="">libmpi.so.12.0 00002B24B3387F61 Unknown Unknown Unknown<br class="">libmpi.so.12.0 00002B24B331E54B Unknown Unknown Unknown<br class="">libmpi.so.12 00002B24B330BC4B MPI_Init Unknown Unknown<br class="">libmpifort.so.12. 00002B24B3DE5240 MPI_INIT Unknown Unknown<br class="">mitgcmuv_ad 00000000009336CD eeboot_minimal_ 1598 eeboot_minimal.f<br class="">mitgcmuv_ad 0000000000933578 eeboot_ 1627 eeboot.f<br class="">mitgcmuv_ad 0000000000985007 MAIN__ 4453 main.f<br class="">mitgcmuv_ad 000000000042D2EE Unknown Unknown Unknown<br class="">libc-2.17.so 00002B24B7256C05 __libc_start_main Unknown Unknown<br class="">mitgcmuv_ad 000000000042D1D9 Unknown Unknown Unknown<br class=""><br class=""><blockquote type="cite" class="">On Feb 28, 2018, at 8:20 AM, Jean-Michel Campin <<a href="mailto:jmc@mit.edu" class="">jmc@mit.edu</a>> wrote:<br class=""><br class="">Hi,<br class=""><br class="">Just to clarify: debugMode=.TRUE. in eedata results in flushing buffer IO to<br class="">STDOUT & STDERR any time something is written.<br class="">This should prevent getting empty STDOUT and STDERR files when something<br class="">has been written.<br class=""><br class="">Cheers,<br class="">Jean-Michel<br class=""><br class="">On Wed, Feb 28, 2018 at 12:14:58PM +0000, David Ferreira wrote:<br class=""><blockquote type="cite" class="">Might be worth to run interactively if you can. On one of the machines I use, the STDOUT and STDERR are filled up with a bit lag, and sometimes in case of a crash they remain completely empty while I suspect something happened. In interactive mode, you may see if indeed something happens.<br class="">cheers,<br class="">david<br class=""><br class="">________________________________<br class="">From: MITgcm-devel [<a href="mailto:mitgcm-devel-bounces@mitgcm.org" class="">mitgcm-devel-bounces@mitgcm.org</a>] on behalf of Matthew Mazloff [<a href="mailto:mmazloff@ucsd.edu" class="">mmazloff@ucsd.edu</a>]<br class="">Sent: Tuesday, February 27, 2018 11:07 PM<br class="">To: <<a href="mailto:MITgcm-devel@mitgcm.org" class="">MITgcm-devel@mitgcm.org</a>><br class="">Cc: Bigdeli, Arash<br class="">Subject: Re: [MITgcm-devel] help with debugging flags during compilation<br class=""><br class="">Hi An<br class=""><br class="">In my experience this is always due to requesting too much memory (or a node thinking you are requesting too much memory). Does this every happen when you run smaller setups? (Or, perhaps, when you run the same setup but with half the vertical levels?)<br class=""><br class="">-Matt<br class=""><br class=""><br class=""><br class=""><br class="">On Feb 27, 2018, at 5:41 PM, An Nguyen <<a href="mailto:antnguyen13@gmail.com" class="">antnguyen13@gmail.com</a><<a href="mailto:antnguyen13@gmail.com" class="">mailto:antnguyen13@gmail.com</a>>> wrote:<br class=""><br class="">hi Jean-Michel,<br class=""><br class="">Thank you for the suggestion, no I did not run with debugMode=.TRUE. in eedata, I might try that now. I will use -convert big_endian and -assume byterecl, and -mcmodel=large.<br class=""><br class="">I think the problem we're encountering here is that it's "not working", and we just can not figure out where the mitgcmuv fails, whether it is the mitgcmuv itself or the infrastructure of the computing node that fails. We do not get any message at all, just "exit" message by the nodes , no STDOUT or STDERR or .err or .out files to even understand where the problem is. So I would like to run the mitgcmuv (and primarily its _ad version where we have the issue) in the debugging mode to see if I can go line-by-line until it crashes to try to narrow down where the problem is.<br class=""><br class="">Thanks,<br class="">An<br class=""><br class=""><br class="">On Feb 27, 2018, at 8:09 PM, Jean-Michel Campin <<a href="mailto:jmc@mit.edu" class="">jmc@mit.edu</a><<a href="mailto:jmc@mit.edu" class="">mailto:jmc@mit.edu</a>>> wrote:<br class=""><br class="">Hi An,<br class=""><br class="">I am not sure I understand correctly:<br class="">Trying to use the simplest CFLAGS, FFLAGS, FOPTIM setting might not be<br class="">the best to compile and run.<br class="">For instance, without:<br class="">-convert big_endian -assume byterecl<br class="">it will not be abble to read any "big-endian" binary, which could be a problem.<br class="">Or without:<br class="">-mcmodel=medium<br class="">it might not run if the memory footprint is too large.<br class=""><br class="">I found that, generally, once I load the right module and set env.variable:<br class="">MPI_INC_DIR to correct location,<br class="">I can use one of the 3 "standard" optfile:<br class="">linux_amd64_gfortran<br class="">linux_amd64_ifort11<br class="">linux_amd64_ifort+impi<br class="">and it just works.<br class="">And the improvement that I would get by fine tuning of some of the compiler<br class="">options is not very significant (and in my case, not worth spending too much time).<br class=""><br class="">And the nice thing about these ones is that you could use "-ieee" or "-devel"<br class="">genmake2 option to turn off all optimisation level (-ieee) or even turn on<br class="">all debug option (-devel).<br class=""><br class="">Now in your case, are you running with debugMode=.TRUE., in eedata ?<br class=""><br class="">Cheers,<br class="">Jean-Michel<br class=""><br class="">On Tue, Feb 27, 2018 at 04:24:04PM -0500, An Nguyen wrote:<br class="">Hello mitgcm gurus,<br class=""><br class="">For the last 2 years I've been having problem with the mitgcm crashing right out of the gate without producing any STDOUT or STDERR or any message at all on the various computing nodes we have at UT Austin, and the only help tech-support can provide us so far is a message saying that we need to do our own debugging.<br class=""><br class="">I'd like to ask for some help with the flags needed to remove all optimization options when compiling the mitgcm (with ifort) in order to enter debug mode such as dbg or this ddt tool they suggested (<a href="https://portal.tacc.utexas.edu/software/ddt" class="">https://portal.tacc.utexas.edu/software/ddt</a> <<a href="https://portal.tacc.utexas.edu/software/ddt" class="">https://portal.tacc.utexas.edu/software/ddt</a>>) . The genmake2 various flags are very elaborate with various options for CFLAGS, FFLAGS, FOPTIM that are really beyond my comprehension (I spent 5 hours last night fiddling around and didn't succeed), so I'm hoping to get some suggestions on what I need to set: would stripping ALL CFLAGS and FFLAGS and FOPTIM and only use<br class=""><br class="">FFLAGS='-g'<br class="">FOPTIM='-O0'<br class="">CFLAGS=''<br class=""><br class="">an option? Any suggestion, including a sampled very-stripped-down optfile would be very helpful for me. I can provide more information (within my comprehension) if needed.<br class=""><br class="">Many thanks,<br class="">An<br class=""></blockquote></blockquote><br class="">_______________________________________________<br class="">MITgcm-devel mailing list<br class=""><a href="mailto:MITgcm-devel@mitgcm.org" class="">MITgcm-devel@mitgcm.org</a><br class="">http://mailman.mitgcm.org/mailman/listinfo/mitgcm-devel<br class=""></div></div></blockquote></div><br class=""></div></div></body></html>