[MITgcm-support] mpi run with cpu more than 9999

Jean-Michel Campin jmc at mit.edu
Tue Mar 10 16:45:48 EDT 2020


Hi Daquan,

>From the code that you listed below, it seems that you are using an older
version (older than Aug 10, 2017) of MITgcm. 
Might be useful to use a more recent version to run on large number of procs.

Cheers,
Jean-Michel

On Tue, Mar 10, 2020 at 10:47:04PM +0300, Daquan Guo wrote:
> Thanks very much Martin and Jean-Michel for your suggestions,
> 
> To update, by simply changing it from I4.4 to I5.5 for the writing of files
> STDERR*, STDOUT* and scratch* in few files related (I listed below), the
> model managed to run. I have not tried the #define SINGLE_DISK_IO (in
> CPP_EEOPTIONS.h), but it looks like a smarter solution, I will give it a
> try.
> 
> eeboot_minimal.F:       WRITE(myProcessStr,'(I5.5)') myProcId
> eeboot_minimal.F:         WRITE(fNam,'(A,A)') 'STDERR.', myProcessStr(1:5)
> eeboot_minimal.F:         WRITE(fNam,'(A,A)') 'STDOUT.', myProcessStr(1:5)
> 
> eeset_parms.F:      WRITE(scratchFile1,'(A,I5.5)') 'scratch1.', myProcId
> eeset_parms.F:      WRITE(scratchFile2,'(A,I5.5)') 'scratch2.', myProcId
> 
> open_copy_data_file.F:      WRITE(scratchFile1,'(A,I5.5)') 'scratch1.',
> myProcId
> open_copy_data_file.F:      WRITE(scratchFile2,'(A,I5.5)') 'scratch2.',
> myProcId
> 
> 
> 
> 
> 
> 
> _____________________________
> Daquan Guo
> Post-doctoral Fellow
> Physical Sciences and Engineering
> King Abdullah University of Science and Technology (KAUST)
> Bldg 1, Lv 4, 4700 KAUST, Thuwal 23955-6900, Jeddah, Saudi Arabia
> Mobile: +966 541048507
> 
> 
> On Tue, Mar 10, 2020 at 10:26 PM Jean-Michel Campin <jmc at mit.edu> wrote:
> 
> > Hi Daquan,
> >
> > Regarding STDOUT & STDERR files, you are right, this need to be fixed.
> > Until now, the only time MITgcm has been run using more than 10000 procs
> > was with  #define SINGLE_DISK_IO (in CPP_EEOPTIONS.h).
> > You might want to give it a try ?
> >
> > But regarding "scratch" files, the ones that are used to copy any parameter
> > file (eedata, data and all data.* ) should have nine digits (FMT_PROC_ID =
> > 'I9.9')
> > for the proc number so it should be OK.
> > May be our scratch file problem is coming from a different place ?
> >
> > Cheers,
> > Jean-Michel
> >
> > On Tue, Mar 10, 2020 at 04:59:52PM +0100, Martin Losch wrote:
> > > Hi Daquan,
> > >
> > > I have no experience with so many processors. I assume that there will
> > be plenty of problems with order 1e4 files open (depending on your file
> > system), but you can fix the ???*****??? problem by changing the
> > definitions in eesupp/src/eeboot_minimal.F:
> > > Look for ???USE_PDAF??? to see how the names of STDERR and STDOUT are
> > changed to have longer numbers and do something similar for the default
> > case.
> > >
> > > Alternatively you can define SINGLE_DISK_IO, but then only process 0
> > (0000) will write a STDOUT/STDERR pair.
> > >
> > > Martin
> > >
> > > > On 10. Mar 2020, at 12:52, Daquan Guo <Daquan.Guo at kaust.edu.sa> wrote:
> > > >
> > > > Dear mitgcm community and developers,
> > > >
> > > > I am running a case with 16650 cpus and facing a problem.
> > > > It seems the files scratch.*, STDERR.* and STDOUT.* can not be written
> > well if the number exceeds 9999, instead it generates one file named
> > 'scratch.****', which can not be read and processed then the model crashed.
> > > > I am wondering if anyone has experience on this and knows how to fix
> > it?
> > > > Thanks in advance.
> > > >
> > > > Best,
> > > > Daquan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > This message and its contents, including attachments are intended
> > solely for the original recipient. If you are not the intended recipient or
> > have received this message in error, please notify me immediately and
> > delete this message from your computer system. Any unauthorized use or
> > distribution is prohibited. Please consider the environment before printing
> > this email._______________________________________________
> > > > MITgcm-support mailing list
> > > > MITgcm-support at mitgcm.org
> > > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > >
> > > _______________________________________________
> > > MITgcm-support mailing list
> > > MITgcm-support at mitgcm.org
> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> >
> 
> -- 
> 
> This message and its contents, including attachments are intended solely 
> for the original recipient. If you are not the intended recipient or have 
> received this message in error, please notify me immediately and delete 
> this message from your computer system. Any unauthorized use or 
> distribution is prohibited. Please consider the environment before printing 
> this email.

> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support



More information about the MITgcm-support mailing list