[MITgcm-support] mpi run with cpu more than 9999
Jean-Michel Campin
jmc at mit.edu
Tue Mar 10 16:45:48 EDT 2020
Hi Daquan,
>From the code that you listed below, it seems that you are using an older
version (older than Aug 10, 2017) of MITgcm.
Might be useful to use a more recent version to run on large number of procs.
Cheers,
Jean-Michel
On Tue, Mar 10, 2020 at 10:47:04PM +0300, Daquan Guo wrote:
> Thanks very much Martin and Jean-Michel for your suggestions,
>
> To update, by simply changing it from I4.4 to I5.5 for the writing of files
> STDERR*, STDOUT* and scratch* in few files related (I listed below), the
> model managed to run. I have not tried the #define SINGLE_DISK_IO (in
> CPP_EEOPTIONS.h), but it looks like a smarter solution, I will give it a
> try.
>
> eeboot_minimal.F: WRITE(myProcessStr,'(I5.5)') myProcId
> eeboot_minimal.F: WRITE(fNam,'(A,A)') 'STDERR.', myProcessStr(1:5)
> eeboot_minimal.F: WRITE(fNam,'(A,A)') 'STDOUT.', myProcessStr(1:5)
>
> eeset_parms.F: WRITE(scratchFile1,'(A,I5.5)') 'scratch1.', myProcId
> eeset_parms.F: WRITE(scratchFile2,'(A,I5.5)') 'scratch2.', myProcId
>
> open_copy_data_file.F: WRITE(scratchFile1,'(A,I5.5)') 'scratch1.',
> myProcId
> open_copy_data_file.F: WRITE(scratchFile2,'(A,I5.5)') 'scratch2.',
> myProcId
>
>
>
>
>
>
> _____________________________
> Daquan Guo
> Post-doctoral Fellow
> Physical Sciences and Engineering
> King Abdullah University of Science and Technology (KAUST)
> Bldg 1, Lv 4, 4700 KAUST, Thuwal 23955-6900, Jeddah, Saudi Arabia
> Mobile: +966 541048507
>
>
> On Tue, Mar 10, 2020 at 10:26 PM Jean-Michel Campin <jmc at mit.edu> wrote:
>
> > Hi Daquan,
> >
> > Regarding STDOUT & STDERR files, you are right, this need to be fixed.
> > Until now, the only time MITgcm has been run using more than 10000 procs
> > was with #define SINGLE_DISK_IO (in CPP_EEOPTIONS.h).
> > You might want to give it a try ?
> >
> > But regarding "scratch" files, the ones that are used to copy any parameter
> > file (eedata, data and all data.* ) should have nine digits (FMT_PROC_ID =
> > 'I9.9')
> > for the proc number so it should be OK.
> > May be our scratch file problem is coming from a different place ?
> >
> > Cheers,
> > Jean-Michel
> >
> > On Tue, Mar 10, 2020 at 04:59:52PM +0100, Martin Losch wrote:
> > > Hi Daquan,
> > >
> > > I have no experience with so many processors. I assume that there will
> > be plenty of problems with order 1e4 files open (depending on your file
> > system), but you can fix the ???*****??? problem by changing the
> > definitions in eesupp/src/eeboot_minimal.F:
> > > Look for ???USE_PDAF??? to see how the names of STDERR and STDOUT are
> > changed to have longer numbers and do something similar for the default
> > case.
> > >
> > > Alternatively you can define SINGLE_DISK_IO, but then only process 0
> > (0000) will write a STDOUT/STDERR pair.
> > >
> > > Martin
> > >
> > > > On 10. Mar 2020, at 12:52, Daquan Guo <Daquan.Guo at kaust.edu.sa> wrote:
> > > >
> > > > Dear mitgcm community and developers,
> > > >
> > > > I am running a case with 16650 cpus and facing a problem.
> > > > It seems the files scratch.*, STDERR.* and STDOUT.* can not be written
> > well if the number exceeds 9999, instead it generates one file named
> > 'scratch.****', which can not be read and processed then the model crashed.
> > > > I am wondering if anyone has experience on this and knows how to fix
> > it?
> > > > Thanks in advance.
> > > >
> > > > Best,
> > > > Daquan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > This message and its contents, including attachments are intended
> > solely for the original recipient. If you are not the intended recipient or
> > have received this message in error, please notify me immediately and
> > delete this message from your computer system. Any unauthorized use or
> > distribution is prohibited. Please consider the environment before printing
> > this email._______________________________________________
> > > > MITgcm-support mailing list
> > > > MITgcm-support at mitgcm.org
> > > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > >
> > > _______________________________________________
> > > MITgcm-support mailing list
> > > MITgcm-support at mitgcm.org
> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> >
>
> --
>
> This message and its contents, including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
More information about the MITgcm-support
mailing list