[MITgcm-support] verification_other case global_oce_cs32 fails at runtime in adjoint mode

Martin Losch Martin.Losch at awi.de
Thu Sep 24 07:16:19 EDT 2020


Hi Dan,

this is what I have for my parallel test:

>> cd verification_other/global_oce_cs32/tr_run.sens
>> ls adm*
adm_boxmean_theta.0000000000.data  adm_boxmean_theta.0000000000.meta  adm_horflux_vol.0000000000.data  adm_horflux_vol.0000000000.meta
>> cd ../run/
>> ls adm*
adm_etastep.0000000000.data  adm_etastep.0000000000.meta  adm_sststep.0000000000.data  adm_sststep.0000000000.meta

So I do have these files

> That's really strange. However, when I run "testreport" in serial mode, I don't get the above error! So I guess the "File does not exist" issue is specific to my parallel adjoint runs for some reason? Specifically:
> 
Is this the parallel run?
> run: ../verification/testreport -of ../tools/build_options/linux_amd64_scihub -command ./mitgcmuv_ad -adm -ncad -t global_oce_cs32
Probably not, because it should have the options -MPI 24 (or 6 or whatever) and -command “mpirun -n 24 ./mitgcmuv_ad"


The fact that your runs explode also point to something completely different. Maybe you can post the exact git clone/genmake/make/run sequence in a sort of script, that you run to reproduce the problem?

Martin
> on : Linux bslws05 3.10.0-1127.10.1.el7.x86_64 #1 SMP Wed Jun 3 14:28:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> 
>   OPTFILE=/data/expose/MITgcm_development/MITgcm/tools/build_options/linux_amd64_scihub
> 
> Adjoint generated by TAF
> 
> default    10
> G D M    C  A  F
> e p a R  o  d  D
> n n k u  s  G  G
> 2 d e n  t  r  r
> 
> Y Y Y N .. .. .. N/O   global_oce_cs32  (e=0, w=48)
> Y Y Y N .. .. .. N/O   global_oce_cs32.sens
> Start time:  Thu 24 Sep 10:47:11 BST 2020
> End time:    Thu 24 Sep 11:05:41 BST 2020
> 
> The "global_oce_cs32" run fails with this error:
> 
>  fail at i,j=  16  17 ; rStarFacC,H,eta =**********  5.158000E+03 -2.329336E+07
>  fail at i,j=  16  17 ; rStarFacS,H,eta =**********  5.091000E+03  2.292655E+05 -2.329336E+07
>  fail at i,j=  17  17 ; rStarFacC,H,eta =**********  5.158000E+03 -2.329336E+07
> WARNING: r*FacC < hFacInf at       2 pts : bi,bj,Thid,Iter=  12   1   1         7
> WARNING: r*FacS < hFacInf at       1 pts : bi,bj,Thid,Iter=  12   1   1         7
> STOP in CALC_R_STAR : too SMALL rStarFac[C,W,S] !
> 
> Whereas the "global_oce_cs32.sens" run does finish, albeit with a note/warning:
> 
> Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> STOP NORMAL END
> PROGRAM MAIN: Execution ended Normally
> 
> In short: the "adm_horflux_vol" and "adm_boxmean_theta" files *do* exist in the testreport run. But they do not exist in my manual run. I don't yet have any idea what the difference might be. 
> 
> I'll try to see if anyone else can replicate my error. Any other thoughts at the moment? 
> 
> Thanks so much again for your help thus far. 
> 
> Best wishes,
> Dan
> 
> On Wed, Sep 23, 2020 at 5:07 PM <mitgcm-support-request at mitgcm.org> wrote:
> Send MITgcm-support mailing list submissions to
>         mitgcm-support at mitgcm.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> or, via email, send a message with subject or body 'help' to
>         mitgcm-support-request at mitgcm.org
> 
> You can reach the person managing the list at
>         mitgcm-support-owner at mitgcm.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of MITgcm-support digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: verification_other case global_oce_cs32 fails at runtime
>       in adjoint mode (Martin Losch)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Wed, 23 Sep 2020 11:54:25 +0200
> From: Martin Losch <Martin.Losch at awi.de>
> To: MITgcm Support <mitgcm-support at mitgcm.org>
> Subject: Re: [MITgcm-support] verification_other case global_oce_cs32
>         fails at runtime in adjoint mode
> Message-ID: <63C0735F-3275-4680-8322-3D96116B10A2 at awi.de>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi Dan,
> 
> the ?N? means that the run did not complete sucessfully. genmake2, make depend, and make all have a Y, meaning sucessful completion.
> 
> I reran the verification_other/global_oce_cs experiment on 24 CPUs on our Cray CS400 with ifort and TAF. and this is the result:
> 
> > run: ../verification/testreport -MPI 24 -j 16 -of ../tools/build_options/linux_ia64_ifort_ollie -command 'srun ./mitgcmuv_ad' -adm -ncad -t global_oce_cs32
> > on : Linux prod-0115 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > 
> >   OPTFILE=/work/ollie/mlosch/test_ollie/MITgcm_ifort/tools/build_options/linux_ia64_ifort_ollie
> > 
> > Adjoint generated by TAF
> > 
> > default    10
> > G D M    C  A  F
> > e p a R  o  d  D
> > n n k u  s  G  G
> > 2 d e n  t  r  r
> > 
> > Y Y Y Y 16>16< 4 pass  global_oce_cs32  (e=0, w=48)
> > Y Y Y Y 16>16<16 pass  global_oce_cs32.sens
> > Start time:  Wed Sep 23 11:46:44 CEST 2020
> > End time:    Wed Sep 23 11:48:14 CEST 2020
> 
> note that my ?mpirun? is actually ?srun?, because we use SLURM. So I have no issue here and I have no idea why this is different for you.
> Can I suggest that you start from scratch again: Get a new clone of MITgcm and verification_other and retry? 
> 
> Martin
> 
> 
> 
> > On 23. Sep 2020, at 11:20, Dan Jones <dcjones.work at gmail.com> wrote:
> > 
> > Hi Martin,
> > 
> > Yes, apologies, I had commented out parts of the cost function for my testing purposes. If I just run the unmodified case, I get:
> > 
> > global fc =  0.543536615356758E+08
> > 
> > But this error still persists:
> > 
> > MDS_READ_FIELD: filename: adm_boxmean_theta.0000000000.data
> > 
> > If I create a blank adm file for it to read, then the computation continues but quickly runs into a different error. And in any case, adding a blank file before running shouldn't be necessary.
> > 
> > I ran the "testreport" on a couple of adjoint cases in the "verification" directory. Both "tutorial_global_oce_optim" and "tutorial_dic_adjoffline" passed. But when I try to run this set of commands:
> > 
> > cd verification_other
> >  ../verification/testreport -t global_oce_cs32/ -adm -j 4 - optfile=../tools/build_options/linux_ia64_cray_archer
> > 
> > Then I get the following result:
> > 
> > Y Y Y N .. .. .. N/O   global_oce_cs32/  (e=0, w=48)
> > Y Y Y N .. .. .. N/O   global_oce_cs32/.sens
> > 
> > Can you remind me what the last "N" stands for again? What has failed? This line of output seems suspicious:
> > 
> > ../verification/testreport: line 845: 32072 Killed                  ./mitgcmuv_ad > output_adm.txt
> > 
> > That line in "testreport" has to do with DIVA, according to the comments. Unfortunately, "output_adm.txt" is blank. 
> > 
> > Are you able to get the adjoint part of the "input_ad.sens" experiment to work, after the cost function has been calculated? 
> > 
> > Best wishes,
> > Dan
> > 
> > On Tue, Sep 22, 2020 at 5:17 PM <mitgcm-support-request at mitgcm.org> wrote:
> > Send MITgcm-support mailing list submissions to
> >         mitgcm-support at mitgcm.org
> > 
> > To subscribe or unsubscribe via the World Wide Web, visit
> >         http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > or, via email, send a message with subject or body 'help' to
> >         mitgcm-support-request at mitgcm.org
> > 
> > You can reach the person managing the list at
> >         mitgcm-support-owner at mitgcm.org
> > 
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of MITgcm-support digest..."
> > 
> > 
> > Today's Topics:
> > 
> >    1. Re: verification_other case global_oce_cs32 fails at runtime
> >       in adjoint mode (Martin Losch)
> > 
> > 
> > ----------------------------------------------------------------------
> > 
> > Message: 1
> > Date: Tue, 22 Sep 2020 17:40:44 +0200
> > From: Martin Losch <Martin.Losch at awi.de>
> > To: MITgcm Support <mitgcm-support at mitgcm.org>
> > Subject: Re: [MITgcm-support] verification_other case global_oce_cs32
> >         fails at runtime in adjoint mode
> > Message-ID: <5B5A14D1-A1DF-435B-B468-E0807AC4C437 at awi.de>
> > Content-Type: text/plain; charset="utf-8"
> > 
> > Hi Dan,
> > 
> > I get very different output and results (e.g. global_fc= 0.543870?D+08).
> > 
> > Can you try to run testreport on this?
> > 
> > cd verification_other
> > ../verification/testreport -t global_oce_cs32 -adm $otheroptions
> > where $otheroptions, are, e.g. -j 4, -devel, etc.. I you want to use your MPI job you?ll need '-MPI 24' and maybe "-command ?mpirun -n 24 ./mitgcmuv?" to run on 24 cpus. check the help section of testreport for details.
> > 
> > The output should be ?fairly? independent of the number of CPUs. I use the default 24 tiles (but run sequentially on one cpu)
> > 
> > Martin
> > 
> > 
> > > On 22. Sep 2020, at 16:52, Dan Jones <dcjones.work at gmail.com> wrote:
> > > 
> > > Hi Martin,
> > > 
> > > Thanks for your quick reply! I can't get the serial case to run on ARCHER, unfortunately. I think for now I'm stuck testing in parallel. When I run "grep m_boxmean_theta *.f", I get exactly the same results as you. I'm also using MITgcm/verification_other from GitHub. 
> > > 
> > > Here is a bit more of the output which will hopefully help. This is from a case where I tried to use the "m_horflux_vol" case instead of the "m_boxmean_theta" case. I'm using data.ecco from the input_ad.sens directory.  
> > > 
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: xx_kapgm.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: xx_kapredi.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: xx_diffkr.0000000000.data
> > > (PID.TID 0000.0001)  --> f_gencost =-0.348173207824978E+08 2
> > > (PID.TID 0000.0001)  --> f_genarr3d = 0.000000000000000E+00 1
> > > (PID.TID 0000.0001)  --> f_genarr3d = 0.000000000000000E+00 2
> > > (PID.TID 0000.0001)  --> f_genarr3d = 0.000000000000000E+00 3
> > > (PID.TID 0000.0001)  --> fc               =-0.348173207824978E+08
> > > (PID.TID 0000.0001)   early fc =  0.000000000000000E+00
> > > (PID.TID 0000.0001)   local fc =  0.000000000000000E+00
> > > (PID.TID 0000.0001)  global fc = -0.348173207824978E+08
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: xx_diffkr.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: adxx_diffkr.0000000000.data
> > > (PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0     1  50   1  50 file=adxx_diffkr.0000000000
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: xx_kapredi.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: adxx_kapredi.0000000000.data
> > > (PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0     1  50   1  50 file=adxx_kapredi.0000000000
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: xx_kapgm.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file: adxx_kapgm.0000000000.data
> > > (PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0     1  50   1  50 file=adxx_kapgm.0000000000
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: filename: adm_horflux_vol.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: File does not exist
> > > 
> > > So it's after the cost function has been calculated, as the model is getting ready to perform the adjoint steps. It's able to read/write for the existing controls (kapgm, kapredi, diffkr). But it's apparently not creating an "ad" file for the general objective function term "horflux". That's why I was wondering if I should manually create a blank file first, as an ad-hoc fix. Any thoughts? 
> > > 
> > > Best wishes,
> > > Dan
> > > 
> > > On Mon, Sep 21, 2020 at 8:37 PM <mitgcm-support-request at mitgcm.org> wrote:
> > > Send MITgcm-support mailing list submissions to
> > >         mitgcm-support at mitgcm.org
> > > 
> > > To subscribe or unsubscribe via the World Wide Web, visit
> > >         http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > > or, via email, send a message with subject or body 'help' to
> > >         mitgcm-support-request at mitgcm.org
> > > 
> > > You can reach the person managing the list at
> > >         mitgcm-support-owner at mitgcm.org
> > > 
> > > When replying, please edit your Subject line so it is more specific
> > > than "Re: Contents of MITgcm-support digest..."
> > > 
> > > 
> > > Today's Topics:
> > > 
> > >    1. verification_other case global_oce_cs32 fails at  runtime in
> > >       adjoint mode (Dan Jones)
> > >    2. Re: verification_other case global_oce_cs32 fails at runtime
> > >       in adjoint mode (Martin Losch)
> > > 
> > > 
> > > ----------------------------------------------------------------------
> > > 
> > > Message: 1
> > > Date: Mon, 21 Sep 2020 10:00:47 +0100
> > > From: Dan Jones <dcjones.work at gmail.com>
> > > To: mitgcm-support at mitgcm.org
> > > Subject: [MITgcm-support] verification_other case global_oce_cs32
> > >         fails at        runtime in adjoint mode
> > > Message-ID:
> > >         <CAPj3iHRxhUOCDT5m7H8uj8cg9dc=_oYVssQnvYhEA+_ALjeR6w at mail.gmail.com>
> > > Content-Type: text/plain; charset="utf-8"
> > > 
> > > Hello.
> > > 
> > > Apologies for the cross-posting - I've posted this as a GitHub issue, but I
> > > thought I should put it here as well.
> > > 
> > > I am trying to build and test the global_oce_cs32 verification_other
> > > exercise using the code in the input_ad.sens directory. The forward case
> > > compiles and runs without error. The adjoint case (built using TAF)
> > > compiles without error, but at runtime I receive the following error in
> > > STDOUT:
> > > 
> > > (PID.TID 0000.0001)  MDS_READVEC_LOC: open file: south30_maskT
> > > (PID.TID 0000.0001)  MDS_RD_REC_RL: iRec,Dim =         9          1
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: filename: adm_boxmean_theta.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: File does not exist
> > > 
> > > and this error in STDERR:
> > > 
> > > (PID.TID 0000.0001) *** ERROR ***  MDS_READ_FIELD: filename:
> > > adm_boxmean_theta.0000000000.data
> > > (PID.TID 0000.0001) *** ERROR ***  MDS_READ_FIELD: File does not exist
> > > 
> > > My MITgcm source code is up-to-date with the master. I am running on
> > > archer.ac.uk <https://www.archer.ac.uk/> in parallel mode using 24 cores.
> > > 
> > > What should I try here? I haven't run into this error before using other
> > > adjoint setups, at least not that I can recall. Should I just create an
> > > empty "dummy" file to start with? Thanks in advance for any help/guidance.
> > > 
> > > Best regards,
> > > Dan
> > > 
> > > 
> > > --------------------------------------------------------------
> > > Dr Dan Jones / British Antarctic Survey
> > > danjonesocean.com <http://www.danjonesocean.com> / @DanJonesOcean
> > > --------------------------------------------------------------
> > > -------------- next part --------------
> > > An HTML attachment was scrubbed...
> > > URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20200921/ddb38a00/attachment-0001.html>
> > > 
> > > ------------------------------
> > > 
> > > Message: 2
> > > Date: Mon, 21 Sep 2020 14:56:23 +0200
> > > From: Martin Losch <Martin.Losch at awi.de>
> > > To: MITgcm Support <mitgcm-support at mitgcm.org>
> > > Subject: Re: [MITgcm-support] verification_other case global_oce_cs32
> > >         fails at runtime in adjoint mode
> > > Message-ID: <FF2DD1AB-462E-4089-90CA-89B9552DF7D8 at awi.de>
> > > Content-Type: text/plain; charset="utf-8"
> > > 
> > > Hi Dan,
> > > 
> > > I tried this on my linux box without MPI and I cannot reproduce your problem (I used MITgcm/verification_other.git and not the CVS MITgcm_contrib/verification_other, which appears to be out of date). I grepped the code for ?m_boxmean_theta? and only found this:
> > > 
> > > (base) bkli04l006::build (master)> grep m_boxmean_theta *.f
> > > ad_input_code_ad.f:     $'m_boxmean_theta') then
> > > ad_input_code_ad.f:     $'m_boxmean_theta') then
> > > ad_input_code_ad.f:     $'m_boxmean_theta') then
> > > ad_input_code_ad.f:     $'m_boxmean_theta') then
> > > ad_input_code.f:            if (gencost_barfile(kgen)(1:15).EQ.'m_boxmean_theta') then
> > > ad_taf_output.f:     $'m_boxmean_theta') then
> > > ad_taf_output.f:     $'m_boxmean_theta') then
> > > ad_taf_output.f:     $'m_boxmean_theta') then
> > > ad_taf_output.f:     $'m_boxmean_theta') then
> > > ecco_check.f:     &          (gencost_barfile(k)(1:15).EQ.'m_boxmean_theta').OR.
> > > ecco_phys.f:            if (gencost_barfile(kgen)(1:15).EQ.'m_boxmean_theta') then
> > > 
> > > (and I made sure that there?s this is really just m_boxmean_theta). Where in your code (which routine) does the model try to read adm_boxmean_theta?
> > > 
> > > Martin
> > > > On 21. Sep 2020, at 11:00, Dan Jones <dcjones.work at gmail.com> wrote:
> > > > 
> > > > Hello. 
> > > > 
> > > > Apologies for the cross-posting - I've posted this as a GitHub issue, but I thought I should put it here as well.
> > > > 
> > > > I am trying to build and test the global_oce_cs32 verification_other exercise using the code in the input_ad.sens directory. The forward case compiles and runs without error. The adjoint case (built using TAF) compiles without error, but at runtime I receive the following error in STDOUT:
> > > > 
> > > > (PID.TID 0000.0001)  MDS_READVEC_LOC: open file: south30_maskT 
> > > > (PID.TID 0000.0001)  MDS_RD_REC_RL: iRec,Dim =         9          1
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: filename: adm_boxmean_theta.0000000000.data
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: File does not exist
> > > > 
> > > > and this error in STDERR:
> > > > 
> > > > (PID.TID 0000.0001) *** ERROR ***  MDS_READ_FIELD: filename: adm_boxmean_theta.0000000000.data
> > > > (PID.TID 0000.0001) *** ERROR ***  MDS_READ_FIELD: File does not exist
> > > > 
> > > > My MITgcm source code is up-to-date with the master. I am running on archer.ac.uk in parallel mode using 24 cores.
> > > > 
> > > > What should I try here? I haven't run into this error before using other adjoint setups, at least not that I can recall. Should I just create an empty "dummy" file to start with? Thanks in advance for any help/guidance.
> > > > 
> > > > Best regards,
> > > > Dan
> > > > 
> > > > --------------------------------------------------------------
> > > > Dr Dan Jones / British Antarctic Survey
> > > > danjonesocean.com / @DanJonesOcean
> > > > --------------------------------------------------------------
> > > > _______________________________________________
> > > > MITgcm-support mailing list
> > > > MITgcm-support at mitgcm.org
> > > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > > 
> > > 
> > > 
> > > ------------------------------
> > > 
> > > Subject: Digest Footer
> > > 
> > > _______________________________________________
> > > MITgcm-support mailing list
> > > MITgcm-support at mitgcm.org
> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > > 
> > > 
> > > ------------------------------
> > > 
> > > End of MITgcm-support Digest, Vol 207, Issue 10
> > > ***********************************************
> > > _______________________________________________
> > > MITgcm-support mailing list
> > > MITgcm-support at mitgcm.org
> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > 
> > 
> > 
> > ------------------------------
> > 
> > Subject: Digest Footer
> > 
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > 
> > 
> > ------------------------------
> > 
> > End of MITgcm-support Digest, Vol 207, Issue 12
> > ***********************************************
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> 
> 
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> 
> 
> ------------------------------
> 
> End of MITgcm-support Digest, Vol 207, Issue 15
> ***********************************************
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support



More information about the MITgcm-support mailing list