[MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations

Yohei Takano - BAS yokano at bas.ac.uk
Tue May 14 13:40:11 EDT 2024


Hi Martin, all,

   Based on Martin's advice and core dump information, I end up changing the MAX_LEN_FNAM (i.e. increased) and so far
the adjoint simulations are running stable (I tried the same simulations several times and so far no crash after increasing the
MAX_LEN_FNAM).

   Still need to continue running (and not sure why MAX_LEN_FNAM increase seem to solve the issue) but I would like to just update my
current status for now.

Regards,

Yohei
________________________________
From: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Yohei Takano - BAS <yokano at bas.ac.uk>
Sent: 13 May 2024 15:16
To: MITgcm Support <mitgcm-support at mitgcm.org>
Subject: Re: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations

Hi Martin,

Thank you for following up on this, so it is not the xx_* files prep issue but something different.
I will try obtaining more information and see if I can trace the lines, and I believe the character length is good...

Regards,

Yohei
________________________________
From: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Martin Losch <Martin.Losch at awi.de>
Sent: 13 May 2024 13:45
To: MITgcm Support <mitgcm-support at mitgcm.org>
Subject: Re: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations

the xx_* files in the zeroth iteration are generated and initialised as zeros by the model, no need to specify them.

Martin

On 13. May 2024, at 14:40, Yohei Takano - BAS <yokano at bas.ac.uk> wrote:

Hi Martin,

Thank you for clarifying and I agree, need to trave the line for #2  0x0000000000519133 in ctrl_map_genarr2d_ad_ ()
(which requires debug mode (devel) from my understanding, figuring out the other issue on this now).

I was tracing the forward/adjoint code found the fnamegenIn (e.g "fnamegenIn is xx_*.$iter.data, so the required access records starts at")
and for example for tauu, I have xx_ files, like "xx_tauu.effective.0000000000.data". The character length is not 80 but it is updated ones
with MAX_LEN_FNAME and it's 512 as in current code.

I might ask before but do we need to prepare xx_* files before running the adjoint sensitivity simulations? I think the codes generates it when
not provided so we thought it is okay...

Regards,

Yohei




________________________________
From: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Martin Losch <Martin.Losch at awi.de>
Sent: 13 May 2024 13:08
To: MITgcm Support <mitgcm-support at mitgcm.org>
Subject: Re: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations

Hi Yohei,

I think you need to find out what the hex-code in this line of your traceback mean (i.e. which line):

#2  0x0000000000519133 in ctrl_map_genarr2d_ad_ ()

The error message is pretty clear, and since the last MITgcm routine is the taf-generated ctrl_map_genarr2d_ad, it should happen somewhere in that routine. But since this routine depends on your setup (CPP-flags, etc), it’s hard to see from here, where this should happen. In the forward code there are write statements like:

>       WRITE(fnamegenIn,'(2A,I10.10)')
>      & ctrlDir(1:ilDir)//fnamebase(1:ilgen),'.',optimcycle
>       WRITE(fnamegenOut,'(2A,I10.10)')
>      & ctrlDir(1:ilDir)//fnamebase(1:ilgen),'.effective.',optimcycle

which also appear in the ad-code, so I guess this is where it happens. What are the lengths of your “fnamegenin/out”? In current code they are “max_len_fnam” = 512, but I remember that we just changed that recently from 80.

Martin

> On 11. May 2024, at 16:55, Yohei Takano - BAS <yokano at bas.ac.uk> wrote:
>
> Hello everyone,
>
>    I followed Martin's suggestion commenting out the STRING part, but the error (core dump) still exists
> although the error slightly changed in core backtrace.
>
> -----
> #0  0x0000153e0f8793be in __cray_memcpy_ROME () from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libu.so.1
> #1  0x0000153e0f88700e in __cray_memmove_ROME () from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libu.so.1
> #2  0x0000000000519133 in ctrl_map_genarr2d_ad_ ()
> #3  0x000000000050a0c0 in ctrl_map_ini_genarr_ad_ ()
> #4  0x0000000000505f4e in ctrl_init_variables_ad_ ()
> #5  0x000000000068c07f in packages_init_variables_ad_ ()
> #6  0x000000000042ca7f in initialise_varia_ad_ ()
> #7  0x000000000041e819 in adthe_main_loop_ ()
> #8  0x0000000000a2c303 in the_model_main_ ()
> #9  0x0000000000978dc0 in main_ ()
> -----
>
>   I tried with debug mode (-devel) but I am having other issue with this on the HPC I am using (Archer2),
> turns out MITgcm is trying to write into a string and is writing too many characters, given the declared length of the string.
> I couldn't trace where this comes from yet (talking to my colleague, he said it could be anywhere...) but now I am a bit stuck.
>
> -----
> lib-4211 : UNRECOVERABLE library error
>   A WRITE operation tried to write a record that was too long.
>
> Encountered during a sequential formatted WRITE to an internal file (character variable)
> -----
>
>    For now, I will try with more stable (solid) adjoint setup and see if the adjoint simulations stably run on the HPC.
> My colleague suggested to try simpler adjoint setup in verification such as below so I will start and try from this setup again.
> https://github.com/MITgcm/MITgcm/tree/master/verification/tutorial_tracer_adjsens
> (I see this has been updated until recent but perhaps Martin can follow up on this).
>
>    Thank you for the suggestion and I will keep updated when something progresses.
>
> Regards,
>
> Yohei
>
>
> From: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Yohei Takano - BAS <yokano at bas.ac.uk>
> Sent: 10 May 2024 14:18
> To: MITgcm Support <mitgcm-support at mitgcm.org>
> Subject: Re: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations
>  Thank you again for your continuous support Martin. I think I managed to finally trace the routine; I will check how things
> are working now.
>
> Regards,
>
> YoheiFrom: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Martin Losch <Martin.Losch at awi.de>
> Sent: 10 May 2024 13:17
> To: MITgcm Support <mitgcm-support at mitgcm.org>
> Subject: Re: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations
>  based on your backtrace message, your code will have a routine “ctrl_map_genarr2d_ad” somewhere (because it appears in the backtrace message). “grep” is your friend:
>
> cd build
> grep -i ctrl_map_genarr2d_ad *_ad.f
>
> Martin
>
> > On 10. May 2024, at 05:30, Yohei Takano - BAS <yokano at bas.ac.uk> wrote:
> >
> > Hi Matin, all,
> >
> >    Sorry again, I followed up and checked the adjoint code but turns out "ctrl_map_genarr2d_ad.F" code does not exist
> > (after make adall) and something strange might be happenning... (more specifically I don't see the forward ctrl_map_genarr2d.F
> > code in checkpoint68r but instead I see "ctrl_map_genarr.F" but still the adjoint code (ctrl_map_genarr_ad.F) is not generated...
> >
> >    I am not sure why this happened (I checked the code and ctrl package is included).
> >
> > Best regards,
> >
> > YoheiFrom: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Martin Losch <Martin.Losch at awi.de>
> > Sent: 09 May 2024 19:56
> > To: MITgcm Support <mitgcm-support at mitgcm.org>
> > Subject: Re: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations
> >  Hi Yohei,
> >
> > Since this is in the AD-code produced by TAF (ctrl_map_genarr2d_ad.F), that’s where you need to put the print statements (in the corresponding subroutine, you’ll findi it in your build directory after run "make adall" (or "make adtaf”)
> >
> > Martin
> >
> >> On 9. May 2024, at 09:32, Yohei Takano - BAS <yokano at bas.ac.uk> wrote:
> >>
> >> Hi Martin,
> >>
> >> Thank you so much for the details, good to know where I can dig in related to the issue.
> >> In pkg/ctrl, I think I found the relevant part of the code (although the code/file name is now different from what you mentioned,
> >> because of update? ,... and I use checkpint68r). In ctrl_map_genarr.F  (instead of ctrl_map_genarr2d_ad.F), I found the relevant
> >> subroutine (CTRL_MAP_GENARR2D) and I think I can try adding some "print-hallo" statement to trace what is happening.
> >> (other thing I spotted is that suroutine CTRL_MAP_GENARR2D is no calling tauu or tauv in the code,  but I might have missed something,
> >> still digging the codes).
> >>
> >> I am now running with the debugging options turned on, hope I can obtain line numbers.
> >> I will also try commenting out the string comparison part (commenting this part will not impact the adjoint simulations... correct?).
> >>
> >> Best regards,
> >>
> >> Yohei
> >>
> >>
> >>
> >> From: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Martin Losch <Martin.Losch at awi.de>
> >> Sent: 09 May 2024 13:17
> >> To: MITgcm Support <mitgcm-support at mitgcm.org>
> >> Subject: Re: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations
> >>  Hi Yohei,
> >>
> >> this is difficult to debug without access to your specific hpc computer. Segmentation fault usually means some memory violation (array boundaries, or in your case maybe the length of a string, because of the _f90_string_compare).
> >>
> >> It looks like it’s failing in the AD-part of initialising the ctrl variables, i.e. when the gradients are finally written to the adxx_tauu files (in this case).
> >> My favorite debugger is the primitive “print-hallo” debugger: I would add print statements just before the call of ctrl_map_genarr2d_ad(tauu,tauu_ad, …) in ctrl_map_ini_genarr_ad.f or even within ctrl_map_genarr2d_ad (which will give you a lot of output) to try to figure out what’s going on.
> >>
> >> You can also run it with debugging options turned on, then the backtrace should give you linenumbers instead of hex-code
> >>
> >> If it runs 1 out of 5 times, the problem is most likely related to your computer or the way your computer handles internal writes (again because of the _F90_STRING_COMPARE). There are few string comparisons in ctrl_map_genarr2d_ad like:
> >>
> >> >       do k2 = 1, maxctrlproc
> >> >         if (xx_genarr2d_preproc(k2,iarr) == 'noscaling') then
> >> >           doscaling =  .false.
> >> >         endif
> >> >         if (xx_genarr2d_preproc_c(k2,iarr) == 'log10ctrl') then
> >> >           dolog10ctrl =  .true.
> >> >           log10initval = xx_genarr2d_preproc_r(k2,iarr)
> >> >         endif
> >> >       end do
> >>
> >>
> >> maybe you can try commenting them out for a test?
> >>
> >> Martin
> >>
> >> > On 9. May 2024, at 07:09, Yohei Takano - BAS <yokano at bas.ac.uk> wrote:
> >> >
> >> > Hello again,
> >> >
> >> >    Following up on this topic, what I got from the core dump information is
> >> >
> >> > -----
> >> > (gdb) backtrace
> >> > #0  0x00001534138194ea in __memcmp_avx2_movbe () from /lib64/libc.so.6
> >> > #1  0x00001534143fed66 in _F90_STRING_COMPARE () from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libfi.so.1
> >> > #2  0x0000000000519160 in ctrl_map_genarr2d_ad_ ()
> >> > #3  0x000000000050a0c0 in ctrl_map_ini_genarr_ad_ ()
> >> > #4  0x0000000000505f4e in ctrl_init_variables_ad_ ()
> >> > #5  0x000000000068c28f in packages_init_variables_ad_ ()
> >> > #6  0x000000000042ca7f in initialise_varia_ad_ ()
> >> > #7  0x000000000041e819 in adthe_main_loop_ ()
> >> > #8  0x0000000000a2c513 in the_model_main_ ()
> >> > #9  0x0000000000978fd0 in main_ ()
> >> > -----
> >> >
> >> >    and STDOUT stops at this point (stops at the same point every time I run the model).
> >> >
> >> > ...
> >> > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: adxx_tauu.0000000000.data
> >> > (PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0   -42  15   1   1 file=adxx_tauu.0000000000
> >> > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: adxx_tauu.effective.00000000
> >> > ...
> >> >
> >> >    I think it is related to data access issues but still not sure why it runs successfully from time to time.
> >> > I am not sure if this is related but, in this configuration, I don't provide 2d or 3d control files in advance (i.e. xx_tauu.*data, meta etc.)
> >> > since I heard this will be generated during the run (and okay for adjoint sensitivity experiments, please correct me if I am wrong).
> >> >
> >> >    Thank you again in advance, please let me know if you have thoughts (or encountered similar issues before).
> >> >
> >> > Best regards,
> >> >
> >> > Yohei
> >> >
> >> >
> >> > From: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Yohei Takano - BAS <yokano at bas.ac.uk>
> >> > Sent: 07 May 2024 12:24
> >> > To: mitgcm-support at mitgcm.org <mitgcm-support at mitgcm.org>
> >> > Subject: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations
> >> >  Hello all,
> >> >
> >> >    My colleague and I have been working on adjoint sensitivity setup & simulations
> >> > based on global coarse resolution biogeochemistry model (https://github.com/MITgcm/MITgcm/tree/master/verification/tutorial_global_oce_biogeo),
> >> > similar configurations but we have our own customization.
> >> >
> >> >     We manage to compile the model and run adjoint test simulations. However, the model crashes
> >> > (with segmentation fault) towards the end of adjoint simulations and we would like your advice to
> >> > figure out why this is happening. The strange part is that we manage to successfully run at one point
> >> > but when try to reproduce with the exact same setting (on the same HPC) it starts to crash again...
> >> > Roughly speaking 1 out of 5 times it runs successfully but most of the time crashes at the same point.
> >> > We are puzzled because we are using the exact same settings/executable every time and wondering what causes
> >> > this unstable situation. The environment should be the same everytime we run the model.
> >> >
> >> >     Does anyone have similar experiences? Here is the code/configuration I have been working on with my colleague Dani Jones.
> >> > https://github.com/ytakano3/MITgcm_BGC_Model_Config
> >> >
> >> >    In "global_ocn3deg_bgcv0" you see "code_ad" and "input_ad/input_ad_kpp_atmco2pv0" and you can compile
> >> > (with TAF) and run the model. It is 2 years test run. When I am trying this, it fails (i.e. segmentation fault) most of the time but
> >> > again say 1 in 5 times the model runs successfully... We would like to figure out why this is happening, why it gets unstable so
> >> > let me know if you have thoughts on this.
> >> >
> >> >    Sorry about the long e-mail but please let me know if anything is unclear. Thank you in advance.
> >> >
> >> > Regards,
> >> >
> >> > Yohei
> >> >
> >> >
> >> > This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.
> >> >
> >> > _______________________________________________
> >> > MITgcm-support mailing list
> >> > MITgcm-support at mitgcm.org
> >> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> >>
> >>
> >> _______________________________________________
> >> MITgcm-support mailing list
> >> MITgcm-support at mitgcm.org
> >> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> >> _______________________________________________
> >> MITgcm-support mailing list
> >> MITgcm-support at mitgcm.org
> >> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> >
> >
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support


_______________________________________________
MITgcm-support mailing list
MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
_______________________________________________
MITgcm-support mailing list
MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20240514/2e04ba77/attachment-0001.html>


More information about the MITgcm-support mailing list