[MITgcm-support] verification_other case global_oce_cs32 fails at runtime in adjoint mode

Dan Jones dcjones.work at gmail.com
Thu Sep 24 06:56:13 EDT 2020


Hi Martin,

I've switched to BAS local HPC. I started from scratch and cloned MITgcm
and verification_other. And I tried compiling and running the serial case.
Unfortunately, I am still getting the same error message:

(PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
adxx_diffkr.0000000000.data
(PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0     1  50
1  50 file=adxx_diffkr.0000000000
(PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
xx_kapredi.0000000000.data
(PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
adxx_kapredi.0000000000.data
(PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0     1  50
1  50 file=adxx_kapredi.0000000000
(PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
xx_kapgm.0000000000.data
(PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
adxx_kapgm.0000000000.data
(PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0     1  50
1  50 file=adxx_kapgm.0000000000
(PID.TID 0000.0001)  MDS_READ_FIELD: filename:
adm_horflux_vol.0000000000.data
(PID.TID 0000.0001)  MDS_READ_FIELD: File does not exist

That's really strange. However, when I run "testreport" in serial mode, I
don't get the above error! So I guess the "File does not exist" issue is
specific to my parallel adjoint runs for some reason? Specifically:

run: ../verification/testreport -of
../tools/build_options/linux_amd64_scihub -command ./mitgcmuv_ad -adm -ncad
-t global_oce_cs32
on : Linux bslws05 3.10.0-1127.10.1.el7.x86_64 #1 SMP Wed Jun 3 14:28:03
UTC 2020 x86_64 x86_64 x86_64 GNU/Linux


OPTFILE=/data/expose/MITgcm_development/MITgcm/tools/build_options/linux_amd64_scihub

Adjoint generated by TAF

default    10
G D M    C  A  F
e p a R  o  d  D
n n k u  s  G  G
2 d e n  t  r  r

Y Y Y N .. .. .. N/O   global_oce_cs32  (e=0, w=48)
Y Y Y N .. .. .. N/O   global_oce_cs32.sens
Start time:  Thu 24 Sep 10:47:11 BST 2020
End time:    Thu 24 Sep 11:05:41 BST 2020

The "global_oce_cs32" run fails with this error:

 fail at i,j=  16  17 ; rStarFacC,H,eta =**********  5.158000E+03
-2.329336E+07
 fail at i,j=  16  17 ; rStarFacS,H,eta =**********  5.091000E+03
 2.292655E+05 -2.329336E+07
 fail at i,j=  17  17 ; rStarFacC,H,eta =**********  5.158000E+03
-2.329336E+07
WARNING: r*FacC < hFacInf at       2 pts : bi,bj,Thid,Iter=  12   1   1
    7
WARNING: r*FacS < hFacInf at       1 pts : bi,bj,Thid,Iter=  12   1   1
    7
STOP in CALC_R_STAR : too SMALL rStarFac[C,W,S] !

Whereas the "global_oce_cs32.sens" run does finish, albeit with a
note/warning:

Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
STOP NORMAL END
PROGRAM MAIN: Execution ended Normally

In short: the "adm_horflux_vol" and "adm_boxmean_theta" files *do* exist in
the testreport run. But they do not exist in my manual run. I don't yet
have any idea what the difference might be.

I'll try to see if anyone else can replicate my error. Any other thoughts
at the moment?

Thanks so much again for your help thus far.


Best wishes,
Dan

On Wed, Sep 23, 2020 at 5:07 PM <mitgcm-support-request at mitgcm.org> wrote:

> Send MITgcm-support mailing list submissions to
>         mitgcm-support at mitgcm.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> or, via email, send a message with subject or body 'help' to
>         mitgcm-support-request at mitgcm.org
>
> You can reach the person managing the list at
>         mitgcm-support-owner at mitgcm.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of MITgcm-support digest..."
>
>
> Today's Topics:
>
>    1. Re: verification_other case global_oce_cs32 fails at runtime
>       in adjoint mode (Martin Losch)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 23 Sep 2020 11:54:25 +0200
> From: Martin Losch <Martin.Losch at awi.de>
> To: MITgcm Support <mitgcm-support at mitgcm.org>
> Subject: Re: [MITgcm-support] verification_other case global_oce_cs32
>         fails at runtime in adjoint mode
> Message-ID: <63C0735F-3275-4680-8322-3D96116B10A2 at awi.de>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Dan,
>
> the ?N? means that the run did not complete sucessfully. genmake2, make
> depend, and make all have a Y, meaning sucessful completion.
>
> I reran the verification_other/global_oce_cs experiment on 24 CPUs on our
> Cray CS400 with ifort and TAF. and this is the result:
>
> > run: ../verification/testreport -MPI 24 -j 16 -of
> ../tools/build_options/linux_ia64_ifort_ollie -command 'srun ./mitgcmuv_ad'
> -adm -ncad -t global_oce_cs32
> > on : Linux prod-0115 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26
> 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> >
> >
>  OPTFILE=/work/ollie/mlosch/test_ollie/MITgcm_ifort/tools/build_options/linux_ia64_ifort_ollie
> >
> > Adjoint generated by TAF
> >
> > default    10
> > G D M    C  A  F
> > e p a R  o  d  D
> > n n k u  s  G  G
> > 2 d e n  t  r  r
> >
> > Y Y Y Y 16>16< 4 pass  global_oce_cs32  (e=0, w=48)
> > Y Y Y Y 16>16<16 pass  global_oce_cs32.sens
> > Start time:  Wed Sep 23 11:46:44 CEST 2020
> > End time:    Wed Sep 23 11:48:14 CEST 2020
>
> note that my ?mpirun? is actually ?srun?, because we use SLURM. So I have
> no issue here and I have no idea why this is different for you.
> Can I suggest that you start from scratch again: Get a new clone of MITgcm
> and verification_other and retry?
>
> Martin
>
>
>
> > On 23. Sep 2020, at 11:20, Dan Jones <dcjones.work at gmail.com> wrote:
> >
> > Hi Martin,
> >
> > Yes, apologies, I had commented out parts of the cost function for my
> testing purposes. If I just run the unmodified case, I get:
> >
> > global fc =  0.543536615356758E+08
> >
> > But this error still persists:
> >
> > MDS_READ_FIELD: filename: adm_boxmean_theta.0000000000.data
> >
> > If I create a blank adm file for it to read, then the computation
> continues but quickly runs into a different error. And in any case, adding
> a blank file before running shouldn't be necessary.
> >
> > I ran the "testreport" on a couple of adjoint cases in the
> "verification" directory. Both "tutorial_global_oce_optim" and
> "tutorial_dic_adjoffline" passed. But when I try to run this set of
> commands:
> >
> > cd verification_other
> >  ../verification/testreport -t global_oce_cs32/ -adm -j 4 -
> optfile=../tools/build_options/linux_ia64_cray_archer
> >
> > Then I get the following result:
> >
> > Y Y Y N .. .. .. N/O   global_oce_cs32/  (e=0, w=48)
> > Y Y Y N .. .. .. N/O   global_oce_cs32/.sens
> >
> > Can you remind me what the last "N" stands for again? What has failed?
> This line of output seems suspicious:
> >
> > ../verification/testreport: line 845: 32072 Killed
> ./mitgcmuv_ad > output_adm.txt
> >
> > That line in "testreport" has to do with DIVA, according to the
> comments. Unfortunately, "output_adm.txt" is blank.
> >
> > Are you able to get the adjoint part of the "input_ad.sens" experiment
> to work, after the cost function has been calculated?
> >
> > Best wishes,
> > Dan
> >
> > On Tue, Sep 22, 2020 at 5:17 PM <mitgcm-support-request at mitgcm.org>
> wrote:
> > Send MITgcm-support mailing list submissions to
> >         mitgcm-support at mitgcm.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >         http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > or, via email, send a message with subject or body 'help' to
> >         mitgcm-support-request at mitgcm.org
> >
> > You can reach the person managing the list at
> >         mitgcm-support-owner at mitgcm.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of MITgcm-support digest..."
> >
> >
> > Today's Topics:
> >
> >    1. Re: verification_other case global_oce_cs32 fails at runtime
> >       in adjoint mode (Martin Losch)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Tue, 22 Sep 2020 17:40:44 +0200
> > From: Martin Losch <Martin.Losch at awi.de>
> > To: MITgcm Support <mitgcm-support at mitgcm.org>
> > Subject: Re: [MITgcm-support] verification_other case global_oce_cs32
> >         fails at runtime in adjoint mode
> > Message-ID: <5B5A14D1-A1DF-435B-B468-E0807AC4C437 at awi.de>
> > Content-Type: text/plain; charset="utf-8"
> >
> > Hi Dan,
> >
> > I get very different output and results (e.g. global_fc= 0.543870?D+08).
> >
> > Can you try to run testreport on this?
> >
> > cd verification_other
> > ../verification/testreport -t global_oce_cs32 -adm $otheroptions
> > where $otheroptions, are, e.g. -j 4, -devel, etc.. I you want to use
> your MPI job you?ll need '-MPI 24' and maybe "-command ?mpirun -n 24
> ./mitgcmuv?" to run on 24 cpus. check the help section of testreport for
> details.
> >
> > The output should be ?fairly? independent of the number of CPUs. I use
> the default 24 tiles (but run sequentially on one cpu)
> >
> > Martin
> >
> >
> > > On 22. Sep 2020, at 16:52, Dan Jones <dcjones.work at gmail.com> wrote:
> > >
> > > Hi Martin,
> > >
> > > Thanks for your quick reply! I can't get the serial case to run on
> ARCHER, unfortunately. I think for now I'm stuck testing in parallel. When
> I run "grep m_boxmean_theta *.f", I get exactly the same results as you.
> I'm also using MITgcm/verification_other from GitHub.
> > >
> > > Here is a bit more of the output which will hopefully help. This is
> from a case where I tried to use the "m_horflux_vol" case instead of the
> "m_boxmean_theta" case. I'm using data.ecco from the input_ad.sens
> directory.
> > >
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> xx_kapgm.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> xx_kapredi.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> xx_diffkr.0000000000.data
> > > (PID.TID 0000.0001)  --> f_gencost =-0.348173207824978E+08 2
> > > (PID.TID 0000.0001)  --> f_genarr3d = 0.000000000000000E+00 1
> > > (PID.TID 0000.0001)  --> f_genarr3d = 0.000000000000000E+00 2
> > > (PID.TID 0000.0001)  --> f_genarr3d = 0.000000000000000E+00 3
> > > (PID.TID 0000.0001)  --> fc               =-0.348173207824978E+08
> > > (PID.TID 0000.0001)   early fc =  0.000000000000000E+00
> > > (PID.TID 0000.0001)   local fc =  0.000000000000000E+00
> > > (PID.TID 0000.0001)  global fc = -0.348173207824978E+08
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> xx_diffkr.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> adxx_diffkr.0000000000.data
> > > (PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0     1
> 50   1  50 file=adxx_diffkr.0000000000
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> xx_kapredi.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> adxx_kapredi.0000000000.data
> > > (PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0     1
> 50   1  50 file=adxx_kapredi.0000000000
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> xx_kapgm.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: opening global file:
> adxx_kapgm.0000000000.data
> > > (PID.TID 0000.0001)  MDS_WRITE_FIELD: it,rec,kS,kL,kH=       0     1
> 50   1  50 file=adxx_kapgm.0000000000
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: filename:
> adm_horflux_vol.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: File does not exist
> > >
> > > So it's after the cost function has been calculated, as the model is
> getting ready to perform the adjoint steps. It's able to read/write for the
> existing controls (kapgm, kapredi, diffkr). But it's apparently not
> creating an "ad" file for the general objective function term "horflux".
> That's why I was wondering if I should manually create a blank file first,
> as an ad-hoc fix. Any thoughts?
> > >
> > > Best wishes,
> > > Dan
> > >
> > > On Mon, Sep 21, 2020 at 8:37 PM <mitgcm-support-request at mitgcm.org>
> wrote:
> > > Send MITgcm-support mailing list submissions to
> > >         mitgcm-support at mitgcm.org
> > >
> > > To subscribe or unsubscribe via the World Wide Web, visit
> > >         http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > > or, via email, send a message with subject or body 'help' to
> > >         mitgcm-support-request at mitgcm.org
> > >
> > > You can reach the person managing the list at
> > >         mitgcm-support-owner at mitgcm.org
> > >
> > > When replying, please edit your Subject line so it is more specific
> > > than "Re: Contents of MITgcm-support digest..."
> > >
> > >
> > > Today's Topics:
> > >
> > >    1. verification_other case global_oce_cs32 fails at  runtime in
> > >       adjoint mode (Dan Jones)
> > >    2. Re: verification_other case global_oce_cs32 fails at runtime
> > >       in adjoint mode (Martin Losch)
> > >
> > >
> > > ----------------------------------------------------------------------
> > >
> > > Message: 1
> > > Date: Mon, 21 Sep 2020 10:00:47 +0100
> > > From: Dan Jones <dcjones.work at gmail.com>
> > > To: mitgcm-support at mitgcm.org
> > > Subject: [MITgcm-support] verification_other case global_oce_cs32
> > >         fails at        runtime in adjoint mode
> > > Message-ID:
> > >         <CAPj3iHRxhUOCDT5m7H8uj8cg9dc=_
> oYVssQnvYhEA+_ALjeR6w at mail.gmail.com>
> > > Content-Type: text/plain; charset="utf-8"
> > >
> > > Hello.
> > >
> > > Apologies for the cross-posting - I've posted this as a GitHub issue,
> but I
> > > thought I should put it here as well.
> > >
> > > I am trying to build and test the global_oce_cs32 verification_other
> > > exercise using the code in the input_ad.sens directory. The forward
> case
> > > compiles and runs without error. The adjoint case (built using TAF)
> > > compiles without error, but at runtime I receive the following error in
> > > STDOUT:
> > >
> > > (PID.TID 0000.0001)  MDS_READVEC_LOC: open file: south30_maskT
> > > (PID.TID 0000.0001)  MDS_RD_REC_RL: iRec,Dim =         9          1
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: filename:
> adm_boxmean_theta.0000000000.data
> > > (PID.TID 0000.0001)  MDS_READ_FIELD: File does not exist
> > >
> > > and this error in STDERR:
> > >
> > > (PID.TID 0000.0001) *** ERROR ***  MDS_READ_FIELD: filename:
> > > adm_boxmean_theta.0000000000.data
> > > (PID.TID 0000.0001) *** ERROR ***  MDS_READ_FIELD: File does not exist
> > >
> > > My MITgcm source code is up-to-date with the master. I am running on
> > > archer.ac.uk <https://www.archer.ac.uk/> in parallel mode using 24
> cores.
> > >
> > > What should I try here? I haven't run into this error before using
> other
> > > adjoint setups, at least not that I can recall. Should I just create an
> > > empty "dummy" file to start with? Thanks in advance for any
> help/guidance.
> > >
> > > Best regards,
> > > Dan
> > >
> > >
> > > --------------------------------------------------------------
> > > Dr Dan Jones / British Antarctic Survey
> > > danjonesocean.com <http://www.danjonesocean.com> / @DanJonesOcean
> > > --------------------------------------------------------------
> > > -------------- next part --------------
> > > An HTML attachment was scrubbed...
> > > URL: <
> http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20200921/ddb38a00/attachment-0001.html
> >
> > >
> > > ------------------------------
> > >
> > > Message: 2
> > > Date: Mon, 21 Sep 2020 14:56:23 +0200
> > > From: Martin Losch <Martin.Losch at awi.de>
> > > To: MITgcm Support <mitgcm-support at mitgcm.org>
> > > Subject: Re: [MITgcm-support] verification_other case global_oce_cs32
> > >         fails at runtime in adjoint mode
> > > Message-ID: <FF2DD1AB-462E-4089-90CA-89B9552DF7D8 at awi.de>
> > > Content-Type: text/plain; charset="utf-8"
> > >
> > > Hi Dan,
> > >
> > > I tried this on my linux box without MPI and I cannot reproduce your
> problem (I used MITgcm/verification_other.git and not the CVS
> MITgcm_contrib/verification_other, which appears to be out of date). I
> grepped the code for ?m_boxmean_theta? and only found this:
> > >
> > > (base) bkli04l006::build (master)> grep m_boxmean_theta *.f
> > > ad_input_code_ad.f:     $'m_boxmean_theta') then
> > > ad_input_code_ad.f:     $'m_boxmean_theta') then
> > > ad_input_code_ad.f:     $'m_boxmean_theta') then
> > > ad_input_code_ad.f:     $'m_boxmean_theta') then
> > > ad_input_code.f:            if
> (gencost_barfile(kgen)(1:15).EQ.'m_boxmean_theta') then
> > > ad_taf_output.f:     $'m_boxmean_theta') then
> > > ad_taf_output.f:     $'m_boxmean_theta') then
> > > ad_taf_output.f:     $'m_boxmean_theta') then
> > > ad_taf_output.f:     $'m_boxmean_theta') then
> > > ecco_check.f:     &
> (gencost_barfile(k)(1:15).EQ.'m_boxmean_theta').OR.
> > > ecco_phys.f:            if
> (gencost_barfile(kgen)(1:15).EQ.'m_boxmean_theta') then
> > >
> > > (and I made sure that there?s this is really just m_boxmean_theta).
> Where in your code (which routine) does the model try to read
> adm_boxmean_theta?
> > >
> > > Martin
> > > > On 21. Sep 2020, at 11:00, Dan Jones <dcjones.work at gmail.com> wrote:
> > > >
> > > > Hello.
> > > >
> > > > Apologies for the cross-posting - I've posted this as a GitHub
> issue, but I thought I should put it here as well.
> > > >
> > > > I am trying to build and test the global_oce_cs32 verification_other
> exercise using the code in the input_ad.sens directory. The forward case
> compiles and runs without error. The adjoint case (built using TAF)
> compiles without error, but at runtime I receive the following error in
> STDOUT:
> > > >
> > > > (PID.TID 0000.0001)  MDS_READVEC_LOC: open file: south30_maskT
> > > > (PID.TID 0000.0001)  MDS_RD_REC_RL: iRec,Dim =         9          1
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: filename:
> adm_boxmean_theta.0000000000.data
> > > > (PID.TID 0000.0001)  MDS_READ_FIELD: File does not exist
> > > >
> > > > and this error in STDERR:
> > > >
> > > > (PID.TID 0000.0001) *** ERROR ***  MDS_READ_FIELD: filename:
> adm_boxmean_theta.0000000000.data
> > > > (PID.TID 0000.0001) *** ERROR ***  MDS_READ_FIELD: File does not
> exist
> > > >
> > > > My MITgcm source code is up-to-date with the master. I am running on
> archer.ac.uk in parallel mode using 24 cores.
> > > >
> > > > What should I try here? I haven't run into this error before using
> other adjoint setups, at least not that I can recall. Should I just create
> an empty "dummy" file to start with? Thanks in advance for any
> help/guidance.
> > > >
> > > > Best regards,
> > > > Dan
> > > >
> > > > --------------------------------------------------------------
> > > > Dr Dan Jones / British Antarctic Survey
> > > > danjonesocean.com / @DanJonesOcean
> > > > --------------------------------------------------------------
> > > > _______________________________________________
> > > > MITgcm-support mailing list
> > > > MITgcm-support at mitgcm.org
> > > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > >
> > >
> > >
> > > ------------------------------
> > >
> > > Subject: Digest Footer
> > >
> > > _______________________________________________
> > > MITgcm-support mailing list
> > > MITgcm-support at mitgcm.org
> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > >
> > >
> > > ------------------------------
> > >
> > > End of MITgcm-support Digest, Vol 207, Issue 10
> > > ***********************************************
> > > _______________________________________________
> > > MITgcm-support mailing list
> > > MITgcm-support at mitgcm.org
> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> >
> >
> >
> > ------------------------------
> >
> > Subject: Digest Footer
> >
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> >
> >
> > ------------------------------
> >
> > End of MITgcm-support Digest, Vol 207, Issue 12
> > ***********************************************
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
>
> ------------------------------
>
> End of MITgcm-support Digest, Vol 207, Issue 15
> ***********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20200924/1bf4d441/attachment-0001.html>


More information about the MITgcm-support mailing list