[MITgcm-support] segmentation fault
Martin Losch
Martin.Losch at awi.de
Fri Jun 1 03:38:54 EDT 2018
Hi Andreas,
I think that this is a new error related to your new mpi-implementation. I think you cannot have a ThID of 0, it means you have no threads, basically no model to communicate with or something like that.
Going back to your original message (assuming that the seg-fault happens in the first time step), and if this not a size issue, it like that the compiler optimization screws up. This is what I would do (and you probably have tried some of these things already, I didn’t follow the thread closely):
1. compile (and run) with a different level of optimization. your optfile (I am assuming that you use linux_amd64_ifort) has
FOPTIM='-O2 -align -xW -ip’
set it so “-O0”. If that works -O1 etc. adding back all options. If it runs with lower optimization then:
2. since your original seg-fault happens in mom_calc_visc.F, put this routine in the list of NOOPTFILES (currently empty and adjust the NOOPTFLAGS for this routine only. Often I have -O1 instead of -O2 for a couple of routines on some hpc-platforms, which will not really hamper your performance.
3. for testing change your number vertical layers Nr to 151 or 149 or something completely different.
4. what happened to the code that you generated with the -devel option? that includes all sorts of debugging options (also -O0), and you try to run the model with a debugger (I know it’s a pain and I never do this myself, but sometimes it can help trace down the problem.
5. try compiling without the Leith code (define AUTODIFF_DISABLE_LEITH or/and AUTODIFF_DISABLE_REYNOLDS_SCALE)
Martin
> On 1. Jun 2018, at 03:55, Andreas Klocker <andreas.klocker at utas.edu.au> wrote:
>
> Hi guys,
>
> After becoming really desperate to get this going without success I have tried different openmpi (1.6.3) and intel-fc (12.1.9.293) versions and I finally managed to get an MITgcm error message before the segmentation fault (including a beautiful copy/paste spelling mistake ;)).
>
> mitgcm.err says:
> ABNROMAL END: S/R BARRIER
> ABNROMAL END: S/R BARRIER
> ABNROMAL END: S/R BARRIER
> ABNROMAL END: S/R BARRIER
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line Source
> libirc.so 00002B73C04B32C9 Unknown Unknown Unknown
> libirc.so 00002B73C04B1B9E Unknown Unknown Unknown
> libifcoremt.so.5 00002B73C2CEB13C Unknown Unknown Unknown
> libifcoremt.so.5 00002B73C2C5A2A2 Unknown Unknown Unknown
> libifcoremt.so.5 00002B73C2C6B0F0 Unknown Unknown Unknown
> libpthread.so.0 00002B73C36157E0 Unknown Unknown Unknown
> . 00000000004E009D Unknown Unknown Unknown
> . 000000000041E2D3 Unknown Unknown Unknown
> . 00000000005B5658 Unknown Unknown Unknown
> ABNROMAL END: S/R BARRIER
> ABNROMAL END: S/R BARRIER
> ABNROMAL END: S/R BARRIER
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line Source
> libirc.so 00002AE9EBD362C9 Unknown Unknown Unknown
> libirc.so 00002AE9EBD34B9E Unknown Unknown Unknown
> libifcoremt.so.5 00002AE9EE56E13C Unknown Unknown Unknown
> libifcoremt.so.5 00002AE9EE4DD2A2 Unknown Unknown Unknown
> libifcoremt.so.5 00002AE9EE4EE0F0 Unknown Unknown Unknown
> libpthread.so.0 00002AE9EEE987E0 Unknown Unknown Unknown
> . 00000000004C5ADB Unknown Unknown Unknown
> . 000000000041C165 Unknown Unknown Unknown
> . 00000000005B5658 Unknown Unknown Unknown
>
> And mitgcm.out:
>
> bash-4.1$ more mitgcm.out
> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
> !!!!!!! PANIC !!!!!!! in S/R BARRIER myThid = 0 nThreads =
> 1
> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
> !!!!!!! PANIC !!!!!!! in S/R BARRIER myThid = 0 nThreads =
> 1
> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
> !!!!!!! PANIC !!!!!!! in S/R BARRIER myThid = 0 nThreads =
> 1
> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
> !!!!!!! PANIC !!!!!!! in S/R BARRIER myThid = 0 nThreads =
> 1
> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
> !!!!!!! PANIC !!!!!!! in S/R BARRIER myThid = 0 nThreads =
> 1
> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
> !!!!!!! PANIC !!!!!!! in S/R BARRIER myThid = 0 nThreads =
> 1
> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
> !!!!!!! PANIC !!!!!!! in S/R BARRIER myThid = 0 nThreads =
> 1
> !!!!!!! PANIC !!!!!!! CATASTROPHIC ERROR
> !!!!!!! PANIC !!!!!!! in S/R BARRIER myThid = 0 nThreads =
>
> I'm struggling to figure out what this means though since that part of the code is far beyond my understanding...but I'm worrying about the amount of "PANIC" in there!
> Has anyone got any suggestions?
>
> cheers,
>
> Andreas
>
>
>
> On 19/05/18 01:37, Patrick Heimbach wrote:
>> Hi Andreas,
>>
>> a small chance (and a bit of a guess) that one of the following might do the trick
>> (we have memory-related issues when running the adjoint):
>>
>> In your shell script or batch job, add
>> ulimit -s unlimited
>>
>> If compiling with ifort, you could try -mcmodel=medium
>>
>> See mitgcm-support thread, e.g.
>>
>> http://mailman.mitgcm.org/pipermail/mitgcm-support/2005-October/003505.html
>>
>> (you'll need to scroll down to input provided by Constantinos Evangelinos).
>>
>> p.
>>
>>
>>> On May 18, 2018, at 8:04 AM, Dimitris Menemenlis <menemenlis at jpl.nasa.gov>
>>> wrote:
>>>
>>> Andreas, I have done something similar quite a few times (i.e., increase horizontal and/or vertical resolution in regional domains with obcs cut-out from a global set-up) and did not have same issue. If helpful I can dig out and commit to contrib some examples that you can compare your set-up against. Actually, you remind me that I already promised to do this for Gael but that it fell off the bottom of my todo list :-(
>>>
>>> Do you have any custom routines in your "code" directory? Have you tried compiling and linking with array-bound checks turned on?
>>>
>>> Dimitris Menemenlis
>>> On 05/17/2018 11:45 PM, Andreas Klocker wrote:
>>>
>>>> Matt,
>>>> I cut all the unnecessary packages and still have the same issue.
>>>> I also checked 'size mitgcmuv' and compared the size to other runs which work - it asks for about half the size of other runs which work fine (same machine, same queue, same compiling options, etc).
>>>> The tiles are already down to 32x32 grid points, and I'm happily running configurations with a tile size almost twice as big and the same amount of vertical layers.
>>>> I will try some different tile sizes but I think the problem must be somewhere else..
>>>> Andreas
>>>> On 18/05/18 00:35, Matthew Mazloff wrote:
>>>>
>>>>> Sounds like a memory issue. I think your executable has become too big for you machine. You will need to reduce tile size or do something else (e.g. reduce number of diagnostics or cut a package)
>>>>>
>>>>> and check
>>>>> size mitgcmuv
>>>>> to get a ballpark idea of how much memory you are requesting
>>>>>
>>>>> Matt
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On May 16, 2018, at 11:27 PM, Andreas Klocker <andreas.klocker at utas.edu.au<mailto:andreas.klocker at utas.edu.au>
>>>>>> > wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I've taken a working 1/24 degree nested simulation (of Drake Passage)
>>>>>> with 42 vertical layers and tried to increase the vertical layers to 150
>>>>>> (without changing anything else apart from obviously my boundary files
>>>>>> for OBCS and recompiling with 150 vertical layers). Suddenly I get the
>>>>>> following error message:
>>>>>>
>>>>>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>>>>>> Image PC Routine Line Source
>>>>>> libirc.so
>>>>>> <http://libirc.so>
>>>>>> 00002BA1704BC2C9 Unknown Unknown Unknown
>>>>>> libirc.so
>>>>>> <http://libirc.so>
>>>>>> 00002BA1704BAB9E Unknown Unknown Unknown
>>>>>> libifcore.so.5 00002BA1722B5F3F Unknown Unknown Unknown
>>>>>> libifcore.so.5 00002BA17221DD7F Unknown Unknown Unknown
>>>>>> libifcore.so.5 00002BA17222EF43 Unknown Unknown Unknown
>>>>>> libpthread.so.0 00002BA1733B27E0 Unknown Unknown Unknown
>>>>>> mitgcmuv_drake24_ 00000000004E61BC mom_calc_visc_ 3345 mom_calc_visc.f
>>>>>> mitgcmuv_drake24_ 0000000000415127 mom_vecinv_ 3453 mom_vecinv.f
>>>>>> mitgcmuv_drake24_ 0000000000601C33 dynamics_ 3426 dynamics.f
>>>>>> mitgcmuv_drake24_ 0000000000613C2B forward_step_ 2229 forward_step.f
>>>>>> mitgcmuv_drake24_ 000000000064581E main_do_loop_ 1886 main_do_loop.f
>>>>>> mitgcmuv_drake24_ 000000000065E500 the_main_loop_ 1904 the_main_loop.f
>>>>>> mitgcmuv_drake24_ 000000000065E6AE the_model_main_ 2394 the_model_main.f
>>>>>> mitgcmuv_drake24_ 00000000005C6439 MAIN__ 3870 main.f
>>>>>> mitgcmuv_drake24_ 0000000000406776 Unknown Unknown Unknown
>>>>>> libc.so.6 00002BA1737E2D1D Unknown Unknown Unknown
>>>>>> mitgcmuv_drake24_ 0000000000406669 Unknown Unknown Unknown
>>>>>>
>>>>>> First this error pointed to a line in mom_calc_visc.f on which
>>>>>> calculations regarding the Leith viscosity are done. As a test I then
>>>>>> used a Smagorinsky viscosity instead and now it crashes with the same
>>>>>> error, but pointing to a line where Smagorinsky calculations are done. I
>>>>>> assume I must be chasing a way more fundamental problem than one related
>>>>>> to these two viscosity choices...but I'm not sure what this might be....
>>>>>>
>>>>>> Has anyone got any idea of what could be going wrong here?
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>> Andreas
>>>>>>
>>>>>>
>>>>>>
>>>>>> University of Tasmania Electronic Communications Policy (December, 2014).
>>>>>> This email is confidential, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone outside the intended recipient organisation is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. The views expressed in this email are not necessarily the views of the University of Tasmania, unless clearly intended otherwise.
>>>>>>
>>>>>> _______________________________________________
>>>>>> MITgcm-support mailing list
>>>>>>
>>>>>> MITgcm-support at mitgcm.org <mailto:MITgcm-support at mitgcm.org>
>>>>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-support mailing list
>>>>>
>>>>> MITgcm-support at mitgcm.org
>>>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>>> --
>>>> ===============================================================
>>>> Dr. Andreas Klocker
>>>> Physical Oceanographer
>>>> ARC Centre of Excellence for Climate System Science
>>>> &
>>>> Institute for Marine and Antarctic Studies
>>>> University of Tasmania
>>>> 20 Castray Esplanade
>>>> Battery Point, TAS
>>>> 7004 Australia
>>>> M: +61 437 870 182
>>>> W:
>>>> http://www.utas.edu.au/profiles/staff/imas/andreas-klocker
>>>>
>>>> skype: andiklocker
>>>> ===============================================================
>>>> _______________________________________________
>>>> MITgcm-support mailing list
>>>>
>>>> MITgcm-support at mitgcm.org
>>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>> _______________________________________________
>>> MITgcm-support mailing list
>>>
>>> MITgcm-support at mitgcm.org
>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>
>>
>> _______________________________________________
>> MITgcm-support mailing list
>>
>> MITgcm-support at mitgcm.org
>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
> --
> ===============================================================
> Dr. Andreas Klocker
> Physical Oceanographer
>
> ARC Centre of Excellence for Climate System Science
> &
> Institute for Marine and Antarctic Studies
> University of Tasmania
> 20 Castray Esplanade
> Battery Point, TAS
> 7004 Australia
>
> M: +61 437 870 182
> W:
> http://www.utas.edu.au/profiles/staff/imas/andreas-klocker
>
> skype: andiklocker
> ===============================================================
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
More information about the MITgcm-support
mailing list