<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">hi Matt,<div class=""><br class=""></div><div class="">We get it to run now! by heavily reducing the nchklev_1 to use maximum only 2GB / cpu (available 4GB/cpu)</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">we're on the TACC machine, using their SBATCH system. What you mention there seems to be the way to go: asking for the memory up front! Do you know the equivalence for SBATCH (instead of PBS)? </div><div class=""><br class=""></div><div class="">I've now figured out the limiting factor is the 1st node (but we already know that from our experience with the NASA computers). What we're also hoping to have the syntax for is to allocate only a very small (even just 1 cpu) for the entire 1st node, then partition cpus equally in the rest of the nodes. We don't know yet if this is possible on TACC machine.</div><div class=""><br class=""></div><div class="">On a last note, it seems that TACC has a lot of cpus but not memory, so it might be the way to go: set the tiles as small as we can , to reduce memory footprint.</div><div class=""><br class=""></div><div class="">cheers,</div><div class="">An</div><div class=""><br class=""></div><div class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Mar 1, 2018, at 11:37 PM, Matthew Mazloff <<a href="mailto:mmazloff@ucsd.edu" class="">mmazloff@ucsd.edu</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><meta http-equiv="Content-Type" content="text/html; charset=us-ascii" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Hi An<div class=""><br class=""></div><div class="">This seems so familiar to me! I really think this is a memory issue and not related to mitgcm code. The machine thinks you need too much memory and kills your job.</div><div class=""><br class=""></div><div class=""><span style="background-color: rgb(255, 255, 255);" class="">Are you using a PBS job scheduler? On the last platform I ran on I had jobs crashing randomly all the time until I added:</span></div><span style="background-color: rgb(255, 255, 255);" class="">#PBS -l pvmem=299gb</span><div class="">Never a problem again after adding that one line!</div><div class=""><br class=""></div><div class="">Matt</div><div class=""><br class=""></div><div class=""><br style="background-color: rgb(255, 255, 255);" class=""><div class=""><div class=""><br class=""></div><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Mar 1, 2018, at 6:05 AM, An Nguyen <<a href="mailto:antnguyen13@gmail.com" class="">antnguyen13@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">hi Jean-Michel, Matt, and David,<br class=""><br class="">Just thought I updated you on this. We're still crashing, but at least now, after turning on debugMode=.TRUE., I think we narrow it down to eeboot_minimal.f (still nothing to STDOUT or STDERR, but it prints the error to a .err file). I'm including the error below in case you have encountered before and have more pointers for us on how to fix this? I've changed nchklev_1 to a very small number (=6) , and asked for two extra nodes (= 2x48 more cpus, 2x192GB more memory) and it's still crashing with the same error. Perhaps there's something I'm missing in eeboot? (Although Arash has just now reported he had succeeded in getting an STDOUT file and it's barely working, with even more requested memory, so it seems to require quite a bit more memory that we thought (?) ).<br class=""><br class="">Matt, to answer your question, i re-configured my regional domain to base on llc90 grid (regional, with obcs, identical setup, just much smaller) and it runs OK. Just the higher res (llc270-based) that we're encountering this on the TACC machines.<br class=""><br class="">cheers,<br class="">An<br class=""><br class="">forrtl: error (78): process killed (SIGTERM)<br class="">Image PC Routine Line Source<br class="">libifcoremt.so.5 00002B24B48FFECF for__signal_handl Unknown Unknown<br class="">libpthread-2.17.s 00002B24B6E245E0 Unknown Unknown Unknown<br class="">libpthread-2.17.s 00002B24B6E236EE read Unknown Unknown<br class="">libmpi.so.12.0 00002B24B34E7919 Unknown Unknown Unknown<br class="">libmpi.so.12.0 00002B24B34E128B Unknown Unknown Unknown<br class="">libmpi.so.12.0 00002B24B34E6E91 Unknown Unknown Unknown<br class="">libmpi.so.12.0 00002B24B3387F61 Unknown Unknown Unknown<br class="">libmpi.so.12.0 00002B24B331E54B Unknown Unknown Unknown<br class="">libmpi.so.12 00002B24B330BC4B MPI_Init Unknown Unknown<br class="">libmpifort.so.12. 00002B24B3DE5240 MPI_INIT Unknown Unknown<br class="">mitgcmuv_ad 00000000009336CD eeboot_minimal_ 1598 eeboot_minimal.f<br class="">mitgcmuv_ad 0000000000933578 eeboot_ 1627 eeboot.f<br class="">mitgcmuv_ad 0000000000985007 MAIN__ 4453 main.f<br class="">mitgcmuv_ad 000000000042D2EE Unknown Unknown Unknown<br class="">libc-2.17.so 00002B24B7256C05 __libc_start_main Unknown Unknown<br class="">mitgcmuv_ad 000000000042D1D9 Unknown Unknown Unknown<br class=""><br class=""></div></div></blockquote></div></div></div></div></div></blockquote></div><br class=""></div></body></html>