[MITgcm-devel] [altMITgcm/MITgcm66h] Bugfix/scratch files (#11)

Jean-Michel Campin jmc at mit.edu
Tue Jul 25 12:17:50 EDT 2017


Hi Martin,

What you propose here is not the best solution but it will not break the
coupled setups that currently work.

An other possibility is to add an argument (e.g., procId ?) to S/R EESET_PARMS,
and use this one to set scratchFile1 & scratchFile2;
Then call EESET_PARMS( doReport, myProcId ) in eeboot.F
And in eeboot_minimal.F, in the #ifdef COMPONENT_MODULE block, 
     call EESET_PARMS( doReport, mpiMyWId )
should work (because MPI_BARRIER is called in eeboot_minimal.F after
 any of the calls to EESET_PARMS).
Still have to do something about the #ifdef ALLOW_OASIS block in eeboot_minimal.F
but was hoping you might have some suggestions there since you added this OASIS thing.

An other thing:
Are we 100% sure that closing a scratch unit file with status "delete" 
is completly standard on all platforms & compilers ? If not, we could
test just this independently (i.e., check-in and see how daily test run). 
The reason is that when someone chose to use USE_FORTRAN_SCRATCH_FILES,
(which is not going to be the default and therefore not tested) we need to be 
sure that the close instruction is OK.

Cheers,
Jean-Michel

On Tue, Jul 25, 2017 at 04:09:03PM +0200, Martin Losch wrote:
> This is all I can come up with:
>       IF ( .NOT.doReport ) THEN
> C     called from eeboot_minimal.F before myProcId is set, so we have to
> C     use scratch files and keep our fingers crossed
>        OPEN(UNIT=scrUnit1,STATUS='SCRATCH')
>        OPEN(UNIT=scrUnit2,STATUS='SCRATCH')
>       ELSE
> C     After opening regular files here, they are closed with STATUS='DELETE'
>        WRITE(scratchFile1,'(A,'//FMT_MYPROC_ID//')') 'scratch1.', myProcId
>        WRITE(scratchFile2,'(A,'//FMT_MYPROC_ID//')') 'scratch2.', myProcId
>        OPEN(UNIT=scrUnit1, FILE=scratchFile1, STATUS='UNKNOWN')
>        OPEN(UNIT=scrUnit2, FILE=scratchFile2, STATUS='UNKNOWN')
>       ENDIF
> But that kind of beats the purpose. What do you think?
> 
> Martin
> 
> > On 25. Jul 2017, at 15:52, Martin Losch <Martin.Losch at awi.de> wrote:
> > 
> > Hi Jean-Michel,
> > 
> > I was not aware of that (because have never run any coupled set up) and I have no idea how to address this. What do you suggest to do?
> > 
> > Martin
> > 
> >> On 25. Jul 2017, at 15:40, Jean-Michel Campin <jmc at mit.edu> wrote:
> >> 
> >> Hi Martin,
> >> 
> >> Just to clarify:
> >> when EESET_PARMS is called from eeboot_minimal.F, myProcId is not yet set
> >> (it's just equal to 0 for all processes).
> >> This feature is used for atm+oce coupled set-up.
> >> Now I agree that there is currently a problem if one want to use
> >> either TARGET_BGL or TARGET_CRAYXT in coupled set-up, but since it's
> >> not the default, it's less a problem.
> >> Now if the default is changed, the coupled set-up will no longer run
> >> (or worst, sometime it will and sometimes not depending on a race 
> >> between the different procs of the same component).
> >> In short, before changing the default, this issue need to be addressed.
> >> 
> >> Regarding point #2, I agree that closing both scrUnit1 & scrUnit2 
> >> units with "delete" is a better test than just changing one of them.
> >> 
> >> Cheers,
> >> Jean-Michel
> >> 
> >> On Tue, Jul 25, 2017 at 03:10:11PM +0200, Martin Losch wrote:
> >>> Hi Jean-Michel,
> >>> 
> >>> I am confused, I am afraid, and I didn???t/don???t understand your suggestions completely.
> >>> 
> >>> (1) issue #1, as you call it, is something that I do not understand and therefore don???t want to touch. You are refering to this line:
> >>>     IF ( .NOT.doReport )
> >>>    &     STOP 'ABNORMAL END: S/R EESET_PARMS: myProcId unset???
> >>> which is exclusive to the SINGLE_DISK_IO code. I have no idea what would happen, if we do this for all cases and would like to leave it to people who know better. Can we leave this aside and deal with it later?
> >>> 
> >>> (2) close one unit, i.e. just close, say scrUnit1 for a test? Fine with me, but what do we gain? I expect that it will only lead to scratchFile2.???? in the run directories. Do we want that? I am attaching, what I had planned to commit for your inspection and scrutiny ??? I tested it on 1CPU linux and Apple computers (with gfortran)
> >>> 
> >>> Martin
> >>> 
> >>> PS. if you are uncomfortable with this entire change, we can leave it as it is.
> >> 
> >> 
> >>> 
> >>>> On 25. Jul 2017, at 15:00, Jean-Michel Campin <jmc at mit.edu> wrote:
> >>>> 
> >>>> Hi Martin,
> >>>> 
> >>>> 1) What about issue #1
> >>>>>> I think this need to be addressed.
> >>>> 
> >>>> 2) Why not stating to just change one "close" statatement as suggested before ?
> >>>> 
> >>>> Jean-Michel
> >>>> 
> >>>> On Tue, Jul 25, 2017 at 09:18:09AM +0200, Martin Losch wrote:
> >>>>> Hi again,
> >>>>> 
> >>>>> if you agree, I will check in a new version of eeset_parms.F where I change the default to what is now done for TARGET_CRAYXT and close the ???scratchFile1/2??? with STATUS=???DELETE??? and we???ll see how this goes.
> >>>>> Everything else (open_copy_data_file with 60+ $pkg_readme.F, and format macro for I9.9 or similar) will follow, if the checks are OK next Monday. 
> >>>>> Does this sound like a plan?
> >>>>> 
> >>>>> Martin
> >>>>> 
> >>>>>> On 24. Jul 2017, at 18:15, Jean-Michel Campin <jmc at mit.edu> wrote:
> >>>>>> 
> >>>>>> Hi Martin,
> >>>>>> 
> >>>>>> I am not concerned about SINGLE_DISK_IO but, as written in the comment,
> >>>>>>>>> C Stop if called from eeboot_minimal.F before myProcId is set
> >>>>>> when EESET_PARMS is called from eeboot_minimal.F
> >>>>>> I think this need to be addressed.
> >>>>>> 
> >>>>>> Otherwise, we could try to change (in one place only) the close statement,
> >>>>>> to see how this goes. And eeset_parms.F would be a good candidate.
> >>>>>> 
> >>>>>> Cheers,
> >>>>>> Jean-Michel
> >>>>>> 
> >>>>>> On Mon, Jul 24, 2017 at 04:48:06PM +0200, Martin Losch wrote:
> >>>>>>> Hi Jean-Michel,
> >>>>>>> 
> >>>>>>> maybe better to move this to the devel list, because it is really something that we can now only test in the ???real repository???:
> >>>>>>> 
> >>>>>>> - Previously the STOP statement was exclusive to SINGLE_DISK_IO, I did not change that, see the CVS-repository. Do we have a single regular test anywhere, where we check this SINGLE_DISK_IO code?
> >>>>>>> - yes, it's exactly like the TARGET_BGL or TARGET_CRAYXT code, except that now, the associated close statements have a CLOSE(iUnit,STATUS='DELETE'), so that the files are deleted. This STATUS='DELETE' is default for files that are opened with STATUS='SCRATCH???. Now i just do it implicitly.
> >>>>>>> 
> >>>>>>> Maybe I can check-in all the close statements for now (they should not change anything, because for the ???SCRATCH??? status files, it???s the default anyway), have everything tested overnight and then do the other two changes in eeset_parms and open_copy_data_file
> >>>>>>> OR
> >>>>>>> (which may be easier), just do this for eeset_parms.F because that???s simple and requires only changing one file.
> >>>>>>> 
> >>>>>>> Martin
> >>>>>>> 
> >>>>>>>> On 24. Jul 2017, at 16:37, jm-c <notifications at github.com> wrote:
> >>>>>>>> 
> >>>>>>>> Hi,
> >>>>>>>> 
> >>>>>>>> It requires more tests. For instance, I read in "eeset_parms.F", line 137:
> >>>>>>>>> C Stop if called from eeboot_minimal.F before myProcId is set
> >>>>>>>> this will also apply to scratchFile1 & 2 settings.
> >>>>>>>> 
> >>>>>>>> Also, it seems that the new way is similar to what was done with
> >>>>>>>> TARGET_BGL or TARGET_CRAYXT defined (or may be I missed something).
> >>>>>>>> May be keeping the old way (i.e., OPEN(UNIT=scrUnit1,STATUS='SCRATCH'))
> >>>>>>>> but changing the default would allow -- in case some problems occur on/with
> >>>>>>>> some disk systems -- to continue to use what currently works.
> >>>>>>>> 
> >>>>>>>> Cheers,
> >>>>>>>> Jean-Michel
> >>>>>>>> 
> >>>>>>>> On Mon, Jul 24, 2017 at 10:41:53AM +0000, christophernhill wrote:
> >>>>>>>>> @mjlosch I think the change look good generally. I need to talk to @jm-c to see if there are catches. At present we don't have testreport pulling from github for different machines, but we should start that soon I think. That means its hard to check if this breaks something.
> >>>>>>>>> 
> >>>>>>>>> Best way to check this is to put in CVS fairly soon, lets see what @jm-c says first though. For a real commit though it would be good to use I9.9 for myProcId (we are running on 100,000+ cores these days). An alternative would be to use a funky internal write and macros for the format string e.g
> >>>>>>>>> in a header file (EEMACORS.h ?)
> >>>>>>>>> ```
> >>>>>>>>> #define FMT_MYPROC_ID I9.9
> >>>>>>>>> ```
> >>>>>>>>> and then
> >>>>>>>>> ```
> >>>>>>>>> WRITE(FMT, ......
> >>>>>>>>> ```
> >>>>>>>>> which is a bit annoying, but more flexible going forward. 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> -- 
> >>>>>>>>> You are receiving this because you were mentioned.
> >>>>>>>>> Reply to this email directly or view it on GitHub:
> >>>>>>>>> https://github.com/altMITgcm/MITgcm66h/pull/11#issuecomment-317385174
> >>>>>>>> ???
> >>>>>>>> You are receiving this because you were mentioned.
> >>>>>>>> Reply to this email directly, view it on GitHub, or mute the thread.
> >>>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> _______________________________________________
> >>>>>>> MITgcm-devel mailing list
> >>>>>>> MITgcm-devel at mitgcm.org
> >>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >>>>>> 
> >>>>>> _______________________________________________
> >>>>>> MITgcm-devel mailing list
> >>>>>> MITgcm-devel at mitgcm.org
> >>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> MITgcm-devel mailing list
> >>>>> MITgcm-devel at mitgcm.org
> >>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >>>> 
> >>>> _______________________________________________
> >>>> MITgcm-devel mailing list
> >>>> MITgcm-devel at mitgcm.org
> >>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >>> 
> >> 
> >>> _______________________________________________
> >>> MITgcm-devel mailing list
> >>> MITgcm-devel at mitgcm.org
> >>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >> 
> >> 
> >> _______________________________________________
> >> MITgcm-devel mailing list
> >> MITgcm-devel at mitgcm.org
> >> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> > 
> > 
> > _______________________________________________
> > MITgcm-devel mailing list
> > MITgcm-devel at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel



More information about the MITgcm-devel mailing list