[MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations

Yohei Takano - BAS yokano at bas.ac.uk
Thu May 9 07:09:45 EDT 2024


Hello again,

   Following up on this topic, what I got from the core dump information is

-----
(gdb) backtrace
#0  0x00001534138194ea in __memcmp_avx2_movbe () from /lib64/libc.so.6
#1  0x00001534143fed66 in _F90_STRING_COMPARE () from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libfi.so.1
#2  0x0000000000519160 in ctrl_map_genarr2d_ad_ ()
#3  0x000000000050a0c0 in ctrl_map_ini_genarr_ad_ ()
#4  0x0000000000505f4e in ctrl_init_variables_ad_ ()
#5  0x000000000068c28f in packages_init_variables_ad_ ()
#6  0x000000000042ca7f in initialise_varia_ad_ ()
#7  0x000000000041e819 in adthe_main_loop_ ()
#8  0x0000000000a2c513 in the_model_main_ ()
#9  0x0000000000978fd0 in main_ ()
-----

   and STDOUT stops at this point (stops at the same point every time I run the model).

...
(PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: adxx_tauu.0000000000.data
(PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0   -42  15   1   1 file=adxx_tauu.0000000000
(PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: adxx_tauu.effective.00000000
...

   I think it is related to data access issues but still not sure why it runs successfully from time to time.
I am not sure if this is related but, in this configuration, I don't provide 2d or 3d control files in advance (i.e. xx_tauu.*data, meta etc.)
since I heard this will be generated during the run (and okay for adjoint sensitivity experiments, please correct me if I am wrong).

   Thank you again in advance, please let me know if you have thoughts (or encountered similar issues before).

Best regards,

Yohei


________________________________
From: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Yohei Takano - BAS <yokano at bas.ac.uk>
Sent: 07 May 2024 12:24
To: mitgcm-support at mitgcm.org <mitgcm-support at mitgcm.org>
Subject: [MITgcm-support] Segmentation Fault in MITgcm Adjoint Simulations

Hello all,

   My colleague and I have been working on adjoint sensitivity setup & simulations
based on global coarse resolution biogeochemistry model (https://github.com/MITgcm/MITgcm/tree/master/verification/tutorial_global_oce_biogeo),
similar configurations but we have our own customization.

    We manage to compile the model and run adjoint test simulations. However, the model crashes
(with segmentation fault) towards the end of adjoint simulations and we would like your advice to
figure out why this is happening. The strange part is that we manage to successfully run at one point
but when try to reproduce with the exact same setting (on the same HPC) it starts to crash again...
Roughly speaking 1 out of 5 times it runs successfully but most of the time crashes at the same point.
We are puzzled because we are using the exact same settings/executable every time and wondering what causes
this unstable situation. The environment should be the same everytime we run the model.

    Does anyone have similar experiences? Here is the code/configuration I have been working on with my colleague Dani Jones.
https://github.com/ytakano3/MITgcm_BGC_Model_Config

   In "global_ocn3deg_bgcv0" you see "code_ad" and "input_ad/input_ad_kpp_atmco2pv0" and you can compile
(with TAF) and run the model. It is 2 years test run. When I am trying this, it fails (i.e. segmentation fault) most of the time but
again say 1 in 5 times the model runs successfully... We would like to figure out why this is happening, why it gets unstable so
let me know if you have thoughts on this.

   Sorry about the long e-mail but please let me know if anything is unclear. Thank you in advance.

Regards,

Yohei



This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20240509/672d0ee5/attachment-0001.html>


More information about the MITgcm-support mailing list