[MITgcm-support] segmentation fault

Matthew Mazloff mmazloff at ucsd.edu
Fri Jun 1 12:56:53 EDT 2018


and (probably redundant but)

6. Make sure the *.F in your code folders are compatible with code tree as you can get this if the variables passed list is not consistent with the variables being received list

Matt

> On Jun 1, 2018, at 12:38 AM, Martin Losch <Martin.Losch at awi.de> wrote:
> 
> Hi Andreas,
> 
> I think that this is a new error related to your new mpi-implementation. I think you cannot have a ThID of 0, it means you have no threads, basically no model to communicate with or something like that.
> 
> Going back to your original message (assuming that the seg-fault happens in the first time step), and if this not a size issue, it like that the compiler optimization screws up. This is what I would do (and you probably have tried some of these things already, I didn’t follow the thread closely):
> 
> 1. compile (and run) with a different level of optimization. your optfile (I am assuming that you use linux_amd64_ifort) has
>     FOPTIM='-O2 -align -xW -ip’
> set it so “-O0”. If that works -O1 etc. adding back all options. If it runs with lower optimization then:
> 2. since your original seg-fault happens in mom_calc_visc.F, put this routine in the list of NOOPTFILES (currently empty and adjust the NOOPTFLAGS for this routine only. Often I have -O1 instead of -O2 for a couple of routines on some hpc-platforms, which will not really hamper your performance.
> 3. for testing change your number vertical layers Nr to 151 or 149 or something completely different.
> 4. what happened to the code that you generated with the -devel option? that includes all sorts of debugging options (also -O0), and you try to run the model with a debugger (I know it’s a pain and I never do this myself, but sometimes it can help trace down the problem.
> 5. try compiling without the Leith code (define AUTODIFF_DISABLE_LEITH or/and AUTODIFF_DISABLE_REYNOLDS_SCALE)
> 
> Martin
> 
> 
>> On 1. Jun 2018, at 03:55, Andreas Klocker <andreas.klocker at utas.edu.au> wrote:
>> 
>> Hi guys,
>> 
>> After becoming really desperate to get this going without success I have tried different openmpi (1.6.3) and intel-fc (12.1.9.293) versions and I finally managed to get an MITgcm error message before the segmentation fault (including a beautiful copy/paste spelling mistake ;)).
>> 
>> mitgcm.err says:
>> ABNROMAL END: S/R BARRIER
>> ABNROMAL END: S/R BARRIER
>> ABNROMAL END: S/R BARRIER
>> ABNROMAL END: S/R BARRIER
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> Image              PC                Routine            Line        Source             
>> libirc.so          00002B73C04B32C9  Unknown               Unknown  Unknown
>> libirc.so          00002B73C04B1B9E  Unknown               Unknown  Unknown
>> libifcoremt.so.5   00002B73C2CEB13C  Unknown               Unknown  Unknown
>> libifcoremt.so.5   00002B73C2C5A2A2  Unknown               Unknown  Unknown
>> libifcoremt.so.5   00002B73C2C6B0F0  Unknown               Unknown  Unknown
>> libpthread.so.0    00002B73C36157E0  Unknown               Unknown  Unknown
>> .                  00000000004E009D  Unknown               Unknown  Unknown
>> .                  000000000041E2D3  Unknown               Unknown  Unknown
>> .                  00000000005B5658  Unknown               Unknown  Unknown
>> ABNROMAL END: S/R BARRIER
>> ABNROMAL END: S/R BARRIER
>> ABNROMAL END: S/R BARRIER
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> Image              PC                Routine            Line        Source             
>> libirc.so          00002AE9EBD362C9  Unknown               Unknown  Unknown
>> libirc.so          00002AE9EBD34B9E  Unknown               Unknown  Unknown
>> libifcoremt.so.5   00002AE9EE56E13C  Unknown               Unknown  Unknown
>> libifcoremt.so.5   00002AE9EE4DD2A2  Unknown               Unknown  Unknown
>> libifcoremt.so.5   00002AE9EE4EE0F0  Unknown               Unknown  Unknown
>> libpthread.so.0    00002AE9EEE987E0  Unknown               Unknown  Unknown
>> .                  00000000004C5ADB  Unknown               Unknown  Unknown
>> .                  000000000041C165  Unknown               Unknown  Unknown
>> .                  00000000005B5658  Unknown               Unknown  Unknown
>> 
>> And mitgcm.out:
>> 
>> bash-4.1$ more mitgcm.out 
>> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
>> !!!!!!! PANIC !!!!!!! in S/R BARRIER  myThid =            0  nThreads = 
>>           1
>> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
>> !!!!!!! PANIC !!!!!!! in S/R BARRIER  myThid =            0  nThreads = 
>>           1
>> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
>> !!!!!!! PANIC !!!!!!! in S/R BARRIER  myThid =            0  nThreads = 
>>           1
>> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
>> !!!!!!! PANIC !!!!!!! in S/R BARRIER  myThid =            0  nThreads = 
>>           1
>> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
>> !!!!!!! PANIC !!!!!!! in S/R BARRIER  myThid =            0  nThreads = 
>>           1
>> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
>> !!!!!!! PANIC !!!!!!! in S/R BARRIER  myThid =            0  nThreads = 
>>           1
>> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
>> !!!!!!! PANIC !!!!!!! in S/R BARRIER  myThid =            0  nThreads = 
>>           1
>> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
>> !!!!!!! PANIC !!!!!!! in S/R BARRIER  myThid =            0  nThreads = 
>> 
>> I'm struggling to figure out what this means though since that part of the code is far beyond my understanding...but I'm worrying about the amount of "PANIC" in there!
>> Has anyone got any suggestions?
>> 
>> cheers,
>> 
>> Andreas
>> 
>> 
>> 
>> On 19/05/18 01:37, Patrick Heimbach wrote:
>>> Hi Andreas,
>>> 
>>> a small chance (and a bit of a guess) that one of the following might do the trick
>>> (we have memory-related issues when running the adjoint):
>>> 
>>> In your shell script or batch job, add
>>> ulimit -s unlimited
>>> 
>>> If compiling with ifort, you could try -mcmodel=medium
>>> 
>>> See mitgcm-support thread, e.g.
>>> 
>>> http://mailman.mitgcm.org/pipermail/mitgcm-support/2005-October/003505.html
>>> 
>>> (you'll need to scroll down to input provided by Constantinos Evangelinos).
>>> 
>>> p.
>>> 
>>> 
>>>> On May 18, 2018, at 8:04 AM, Dimitris Menemenlis <menemenlis at jpl.nasa.gov>
>>>> wrote:
>>>> 
>>>> Andreas, I have done something similar quite a few times (i.e., increase horizontal and/or vertical resolution in regional domains with obcs cut-out from a global set-up) and did not have same issue.  If helpful I can dig out and commit to contrib some examples that you can compare your set-up against.  Actually, you remind me that I already promised to do this for Gael but that it fell off the bottom of my todo list :-(
>>>> 
>>>> Do you have any custom routines in your "code" directory?  Have you tried compiling and linking with array-bound checks turned on?
>>>> 
>>>> Dimitris Menemenlis
>>>> On 05/17/2018 11:45 PM, Andreas Klocker wrote:
>>>> 
>>>>> Matt,
>>>>> I cut all the unnecessary packages and still have the same issue.
>>>>> I also checked 'size mitgcmuv' and compared the size to other runs which work - it asks for about half the size of other runs which work fine (same machine, same queue, same compiling options, etc).
>>>>> The tiles are already down to 32x32 grid points, and I'm happily running configurations with a tile size almost twice as big and the same amount of vertical layers.
>>>>> I will try some different tile sizes but I think the problem must be somewhere else..
>>>>> Andreas
>>>>> On 18/05/18 00:35, Matthew Mazloff wrote:
>>>>> 
>>>>>> Sounds like a memory issue. I think your executable has become too big for you machine. You will need to reduce tile size or do something else (e.g. reduce number of diagnostics or cut a package)
>>>>>> 
>>>>>> and check
>>>>>> size mitgcmuv
>>>>>> to get a ballpark idea of how much memory you are requesting
>>>>>> 
>>>>>> Matt
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On May 16, 2018, at 11:27 PM, Andreas Klocker <andreas.klocker at utas.edu.au<mailto:andreas.klocker at utas.edu.au>
>>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi guys,
>>>>>>> 
>>>>>>> I've taken a working 1/24 degree nested simulation (of Drake Passage)
>>>>>>> with 42 vertical layers and tried to increase the vertical layers to 150
>>>>>>> (without changing anything else apart from obviously my boundary files
>>>>>>> for OBCS and recompiling with 150 vertical layers). Suddenly I get the
>>>>>>> following error message:
>>>>>>> 
>>>>>>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>>>>>>> Image              PC                Routine Line        Source
>>>>>>> libirc.so 
>>>>>>> <http://libirc.so>
>>>>>>>          00002BA1704BC2C9  Unknown Unknown  Unknown
>>>>>>> libirc.so 
>>>>>>> <http://libirc.so>
>>>>>>>          00002BA1704BAB9E  Unknown Unknown  Unknown
>>>>>>> libifcore.so.5     00002BA1722B5F3F  Unknown Unknown  Unknown
>>>>>>> libifcore.so.5     00002BA17221DD7F  Unknown Unknown  Unknown
>>>>>>> libifcore.so.5     00002BA17222EF43  Unknown Unknown  Unknown
>>>>>>> libpthread.so.0    00002BA1733B27E0  Unknown Unknown  Unknown
>>>>>>> mitgcmuv_drake24_  00000000004E61BC  mom_calc_visc_ 3345 mom_calc_visc.f
>>>>>>> mitgcmuv_drake24_  0000000000415127  mom_vecinv_ 3453 mom_vecinv.f
>>>>>>> mitgcmuv_drake24_  0000000000601C33  dynamics_ 3426  dynamics.f
>>>>>>> mitgcmuv_drake24_  0000000000613C2B  forward_step_ 2229 forward_step.f
>>>>>>> mitgcmuv_drake24_  000000000064581E  main_do_loop_ 1886 main_do_loop.f
>>>>>>> mitgcmuv_drake24_  000000000065E500  the_main_loop_ 1904 the_main_loop.f
>>>>>>> mitgcmuv_drake24_  000000000065E6AE  the_model_main_ 2394 the_model_main.f
>>>>>>> mitgcmuv_drake24_  00000000005C6439  MAIN__ 3870  main.f
>>>>>>> mitgcmuv_drake24_  0000000000406776  Unknown Unknown  Unknown
>>>>>>> libc.so.6          00002BA1737E2D1D  Unknown Unknown  Unknown
>>>>>>> mitgcmuv_drake24_  0000000000406669  Unknown Unknown  Unknown
>>>>>>> 
>>>>>>> First this error pointed to a line in mom_calc_visc.f on which
>>>>>>> calculations regarding the Leith viscosity are done. As a test I then
>>>>>>> used a Smagorinsky viscosity instead and now it crashes with the same
>>>>>>> error, but pointing to a line where Smagorinsky calculations are done. I
>>>>>>> assume I must be chasing a way more fundamental problem than one related
>>>>>>> to these two viscosity choices...but I'm not sure what this might be....
>>>>>>> 
>>>>>>> Has anyone got any idea of what could be going wrong here?
>>>>>>> 
>>>>>>> Thanks in advance!
>>>>>>> 
>>>>>>> Andreas
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> University of Tasmania Electronic Communications Policy (December, 2014).
>>>>>>> This email is confidential, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone outside the intended recipient organisation is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. The views expressed in this email are not necessarily the views of the University of Tasmania, unless clearly intended otherwise.
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> MITgcm-support mailing list
>>>>>>> 
>>>>>>> MITgcm-support at mitgcm.org <mailto:MITgcm-support at mitgcm.org>
>>>>>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> MITgcm-support mailing list
>>>>>> 
>>>>>> MITgcm-support at mitgcm.org
>>>>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>>>> -- 
>>>>> ===============================================================
>>>>> Dr. Andreas Klocker
>>>>> Physical Oceanographer
>>>>> ARC Centre of Excellence for Climate System Science
>>>>> &
>>>>> Institute for Marine and Antarctic Studies
>>>>> University of Tasmania
>>>>> 20 Castray Esplanade
>>>>> Battery Point, TAS
>>>>> 7004 Australia
>>>>> M:     +61 437 870 182
>>>>> W:
>>>>> http://www.utas.edu.au/profiles/staff/imas/andreas-klocker
>>>>> 
>>>>> skype: andiklocker
>>>>> ===============================================================
>>>>> _______________________________________________
>>>>> MITgcm-support mailing list
>>>>> 
>>>>> MITgcm-support at mitgcm.org
>>>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>>> _______________________________________________
>>>> MITgcm-support mailing list
>>>> 
>>>> MITgcm-support at mitgcm.org
>>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>> 
>>> 
>>> _______________________________________________
>>> MITgcm-support mailing list
>>> 
>>> MITgcm-support at mitgcm.org
>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> 
>> -- 
>> ===============================================================
>> Dr. Andreas Klocker
>> Physical Oceanographer
>> 
>> ARC Centre of Excellence for Climate System Science
>> &
>> Institute for Marine and Antarctic Studies
>> University of Tasmania
>> 20 Castray Esplanade
>> Battery Point, TAS
>> 7004 Australia
>> 
>> M:     +61 437 870 182
>> W:     
>> http://www.utas.edu.au/profiles/staff/imas/andreas-klocker
>> 
>> skype: andiklocker
>> ===============================================================
>> 
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> 
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support



More information about the MITgcm-support mailing list