[MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations

Yohei Takano - BAS yokano at bas.ac.uk
Fri May 10 05:02:07 EDT 2024


Hi Martin,

Thank you for following up, I see you are talking about the AD-code, sorry I missed that (and yes, I see the AD-code).
I will try including the print statement (along with checking the core dump with debug mode).

Regards,

Yohei

________________________________
From: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Martin Losch <Martin.Losch at awi.de>
Sent: 09 May 2024 19:56
To: MITgcm Support <mitgcm-support at mitgcm.org>
Subject: Re: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations

Hi Yohei,

Since this is in the AD-code produced by TAF (ctrl_map_genarr2d_ad.F), that’s where you need to put the print statements (in the corresponding subroutine, you’ll findi it in your build directory after run "make adall" (or "make adtaf”)

Martin

On 9. May 2024, at 09:32, Yohei Takano - BAS <yokano at bas.ac.uk> wrote:

Hi Martin,

Thank you so much for the details, good to know where I can dig in related to the issue.
In pkg/ctrl, I think I found the relevant part of the code (although the code/file name is now different from what you mentioned,
because of update? ,... and I use checkpint68r). In ctrl_map_genarr.F  (instead of ctrl_map_genarr2d_ad.F), I found the relevant
subroutine (CTRL_MAP_GENARR2D) and I think I can try adding some "print-hallo" statement to trace what is happening.
(other thing I spotted is that suroutine CTRL_MAP_GENARR2D is no calling tauu or tauv in the code,  but I might have missed something,
still digging the codes).

I am now running with the debugging options turned on, hope I can obtain line numbers.
I will also try commenting out the string comparison part (commenting this part will not impact the adjoint simulations... correct?).

Best regards,

Yohei



________________________________
From: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Martin Losch <Martin.Losch at awi.de>
Sent: 09 May 2024 13:17
To: MITgcm Support <mitgcm-support at mitgcm.org>
Subject: Re: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations

Hi Yohei,

this is difficult to debug without access to your specific hpc computer. Segmentation fault usually means some memory violation (array boundaries, or in your case maybe the length of a string, because of the _f90_string_compare).

It looks like it’s failing in the AD-part of initialising the ctrl variables, i.e. when the gradients are finally written to the adxx_tauu files (in this case).
My favorite debugger is the primitive “print-hallo” debugger: I would add print statements just before the call of ctrl_map_genarr2d_ad(tauu,tauu_ad, …) in ctrl_map_ini_genarr_ad.f or even within ctrl_map_genarr2d_ad (which will give you a lot of output) to try to figure out what’s going on.

You can also run it with debugging options turned on, then the backtrace should give you linenumbers instead of hex-code

If it runs 1 out of 5 times, the problem is most likely related to your computer or the way your computer handles internal writes (again because of the _F90_STRING_COMPARE). There are few string comparisons in ctrl_map_genarr2d_ad like:

>       do k2 = 1, maxctrlproc
>         if (xx_genarr2d_preproc(k2,iarr) == 'noscaling') then
>           doscaling =  .false.
>         endif
>         if (xx_genarr2d_preproc_c(k2,iarr) == 'log10ctrl') then
>           dolog10ctrl =  .true.
>           log10initval = xx_genarr2d_preproc_r(k2,iarr)
>         endif
>       end do


maybe you can try commenting them out for a test?

Martin

> On 9. May 2024, at 07:09, Yohei Takano - BAS <yokano at bas.ac.uk> wrote:
>
> Hello again,
>
>    Following up on this topic, what I got from the core dump information is
>
> -----
> (gdb) backtrace
> #0  0x00001534138194ea in __memcmp_avx2_movbe () from /lib64/libc.so.6
> #1  0x00001534143fed66 in _F90_STRING_COMPARE () from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libfi.so.1
> #2  0x0000000000519160 in ctrl_map_genarr2d_ad_ ()
> #3  0x000000000050a0c0 in ctrl_map_ini_genarr_ad_ ()
> #4  0x0000000000505f4e in ctrl_init_variables_ad_ ()
> #5  0x000000000068c28f in packages_init_variables_ad_ ()
> #6  0x000000000042ca7f in initialise_varia_ad_ ()
> #7  0x000000000041e819 in adthe_main_loop_ ()
> #8  0x0000000000a2c513 in the_model_main_ ()
> #9  0x0000000000978fd0 in main_ ()
> -----
>
>    and STDOUT stops at this point (stops at the same point every time I run the model).
>
> ...
> (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: adxx_tauu.0000000000.data
> (PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0   -42  15   1   1 file=adxx_tauu.0000000000
> (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: adxx_tauu.effective.00000000
> ...
>
>    I think it is related to data access issues but still not sure why it runs successfully from time to time.
> I am not sure if this is related but, in this configuration, I don't provide 2d or 3d control files in advance (i.e. xx_tauu.*data, meta etc.)
> since I heard this will be generated during the run (and okay for adjoint sensitivity experiments, please correct me if I am wrong).
>
>    Thank you again in advance, please let me know if you have thoughts (or encountered similar issues before).
>
> Best regards,
>
> Yohei
>
>
> From: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Yohei Takano - BAS <yokano at bas.ac.uk>
> Sent: 07 May 2024 12:24
> To: mitgcm-support at mitgcm.org <mitgcm-support at mitgcm.org>
> Subject: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations
>  Hello all,
>
>    My colleague and I have been working on adjoint sensitivity setup & simulations
> based on global coarse resolution biogeochemistry model (https://github.com/MITgcm/MITgcm/tree/master/verification/tutorial_global_oce_biogeo),
> similar configurations but we have our own customization.
>
>     We manage to compile the model and run adjoint test simulations. However, the model crashes
> (with segmentation fault) towards the end of adjoint simulations and we would like your advice to
> figure out why this is happening. The strange part is that we manage to successfully run at one point
> but when try to reproduce with the exact same setting (on the same HPC) it starts to crash again...
> Roughly speaking 1 out of 5 times it runs successfully but most of the time crashes at the same point.
> We are puzzled because we are using the exact same settings/executable every time and wondering what causes
> this unstable situation. The environment should be the same everytime we run the model.
>
>     Does anyone have similar experiences? Here is the code/configuration I have been working on with my colleague Dani Jones.
> https://github.com/ytakano3/MITgcm_BGC_Model_Config
>
>    In "global_ocn3deg_bgcv0" you see "code_ad" and "input_ad/input_ad_kpp_atmco2pv0" and you can compile
> (with TAF) and run the model. It is 2 years test run. When I am trying this, it fails (i.e. segmentation fault) most of the time but
> again say 1 in 5 times the model runs successfully... We would like to figure out why this is happening, why it gets unstable so
> let me know if you have thoughts on this.
>
>    Sorry about the long e-mail but please let me know if anything is unclear. Thank you in advance.
>
> Regards,
>
> Yohei
>
>
> This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support


_______________________________________________
MITgcm-support mailing list
MITgcm-support at mitgcm.org
http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
_______________________________________________
MITgcm-support mailing list
MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20240510/ab5171da/attachment-0001.html>


More information about the MITgcm-support mailing list