[MITgcm-support] error while writing pickup files with Cray compilers

Martin Losch Martin.Losch at awi.de
Wed Apr 12 04:25:03 EDT 2017


Hi Laura,

I don’t have an answer, but I can share my experience with cray compilers on two different cray computers (and at the end of the email I have two suggestions that may help):
The first is at ECWMF. I think it is pretty similar to the ARCHER system (XC30 or XC40 with similar cpu units). There is an optfile for that machine; linux_ia64_cray_cca
The other is our “own” Cray CS400 “ollie", which has similar cpu-units, but a different network (or whatever this actual difference is). That is pretty unique (I guess it’s the first time Cray delivered something like that), but also quite buggy. The opt file is  linux_ia64_cray_ollie
Appart from the usual problems with HPC cluster, the cray compiler was never really a problem in any of these systems, and usually it is a little faster than the intel compiler, if available (a few percent maybe). I never tried gfortran. I glanced through the report that you linked and I am a little susprised by the problems that are described there (assuming that ARCHER does not have too many peculiarities). A lot o the performance depends very much on the details of how you submit the job. E.g. in contrast to what is described in the report, hybrid jobs with MPI and multithreading (OpenMP) are, in my experience, usually a little faster (insignificantly, so that I would not necessarily recommend going through the trouble of setting them up), but they require to fiddle with the details of the cpu-binding (options or environment variables), which is done in “mpirun”, “srun”, “arun” or whatever is used on the system to start the jobs (and cannot be done within MITgcm). For example, I have the best results on our computer (ollie) with
export OMP_PROC_BIND=close
# for some reason --distribution=block:block is faster
srun --cpu_bind=cores --distribution=block:block ./mitgcmuv
I am not sure to what extent that can be carried over to other computers.

By comparing the opt files I noticed that for ARCHER you have 
DEFINES=‘[…] -D_BYTESWAPIO […]’
but in the other two opt files (both of which are from the same author, myself, so their similarity is not surprising), this CPP-flag is not set, instead I use the compiler flag "-h byteswapio”
The difference is that -D_BYTESWAPIO enables fortran code for byte swapping (from the generic ieee-little-endian on a linux machine to the MITgcm standard ieee-big-endian byteordering), while the compiler flag does this re-ordering somehow internally. I would alway prefer the compiler flag, because it usually uses optimized code and the other option can be slower. Here’s a comment line from the code:
"C Created: 05/05/99 adcroft at mit.edu (This is an unfortunate hack!!)"

Since the pickup seems to be the only file where write MDS output, this may have to do with this. 

The other possible issue that I see (from you “data” file) is that this
# Save a single pickup file (no tiled)
 globalFiles=.TRUE.,
# useSingleCPUIO = .TRUE.,

does not always work (depending on the file system). Instead try
# globalFiles=.TRUE.,
 useSingleCPUIO = .TRUE.,
On many systems this is even faster (at least on the cray systems that I have access to).

I hope that helps a little.
Martin


> On 11 Apr 2017, at 23:58, Jody Klymak <jklymak at uvic.ca> wrote:
> 
> Hopefully people who really understand the compiler issues will pipe up.  I assume you are running w/ MPI - maybe the parallel writing of the mds files is failing somehow? But if you have a compiler issue:
> 
> - Did you try running w/ the optimizations turned off?  (edit the `linux_ia64_cray_archer` file)
> - in `data` you shoudl set `debugLevel=5` or something large like that and see if there are clues in the output.
> 
> Good luck!  Jody
> 
> 
>> On 11 Apr 2017, at  13:39 PM, Laura Cimoli <laura.cimoli at physics.ox.ac.uk> wrote:
>> 
>> Hi Jody,
>> 
>> yes, it does start writing the pickup file. 
>> I also made a few other tests (of course much shorter than 100 y!), and I got always the same error. Also, the configuration works with the gnu compiler, but if the Cray compiler is really 10x faster it would be nice to use it!
>> 
>> I wonder if the pickup file is overwritten or if it is appending to the file...? Maybe it is doing something funny when trying to appending to it?
>> 
>> Thanks,
>> Laura
>> 
>>  
>> From: Jody Klymak [jklymak at uvic.ca]
>> Sent: 11 April 2017 21:27
>> To: mitgcm-support at mitgcm.org
>> Subject: Re: [MITgcm-support] error while writing pickup files with Cray compilers
>> 
>> Hi Laura,
>> 
>> Are you sure the mitgcm can write to the directory it is trying to write to?  Does it *start* to write the pickup file?  
>> 
>> These are just dumb questions.  Maybe it truly is a compiler issue, but it seems more likely it is a configuration issue.   Obviously, for testing I’d suggest writing a pickup file well before 100 y has passed.
>> 
>> Good luck, 
>> 
>> Jody
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 11 Apr 2017, at  11:21 AM, Laura Cimoli <laura.cimoli at physics.ox.ac.uk> wrote:
>>> 
>>> Hello Jody,
>>> 
>>> sorry I forgot to mention that all my other outputs are in netcdf format, and they look fine.
>>> The data file is attached.
>>> 
>>> Thanks, 
>>> Laura
>>> 
>>> From: Jody Klymak [jklymak at uvic.ca]
>>> Sent: 11 April 2017 19:01
>>> To: mitgcm-support at mitgcm.org
>>> Subject: Re: [MITgcm-support] error while writing pickup files with Cray compilers
>>> 
>>> Are you able to write any mds files?  i.e. did the T.000000000000.data file write?  Can you supply your `data` file?
>>> 
>>> Cheers,   Jody
>>> 
>>> 
>>>> On 11 Apr 2017, at  10:54 AM, Laura Cimoli <laura.cimoli at physics.ox.ac.uk> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> this question is relevant mainly for Archer user, but of course any help is appreciated!
>>>> 
>>>> I have recently tried to use Cray instead of gnu compilers, since the model should run much faster according to what stated here. I have to admit I have not read that report in detail, but I hope that there are not particular constraints on the use of Cray compilers on Archer.
>>>> 
>>>> I used the linux_ia64_cray_archer optfile, as indicated in the report.
>>>> 
>>>> At a first glance, the model is compiled without any odd warning, and seems to run without any problem, but it crashes when writing the pickup file. This is the message I got (the whole error file is attached):
>>>> 
>>>> lib-5058 : UNRECOVERABLE library error
>>>> A read system call read less data than expected.
>>>> 
>>>> Encountered during a direct access unformatted WRITE to unit 9
>>>> Fortran unit 9 is connected to a direct unformatted unblocked file:
>>>> "pickup.0001752000.data"
>>>> 
>>>> _pmiu_daemon(SIGCHLD): [NID 02940] [c7-1c0s15n0] [Tue Apr 11 08:49:37 2017] PE RANK 69 exit signal Aborted
>>>> [NID 02940] 2017-04-11 08:49:37 Apid 26123498: initiated application termination
>>>> 
>>>> 
>>>> I am writing the permanent pickup file, and I don't have any temporary pickup file.
>>>> 
>>>> The only weird warning I have noticed in the genmake.log file (attached) is below, but I don't know whether it is related to the problem reported above:
>>>> 
>>>> running: check_HAVE_SIGREG() 
>>>> cc -c genmake_tc_1.c 
>>>> CC-513 craycc: WARNING File = genmake_tc_1.c, Line = 22
>>>> A value of type "void *" cannot be assigned to an entity of type
>>>> "void (*)(int, siginfo_t *, void *)".
>>>> s.sa_sigaction = (void *)killhandler;
>>>> ^
>>>> Total warnings detected in genmake_tc_1.c: 1
>>>> program hello
>>>> integer anint
>>>> common /iv/ anint
>>>> external sigreg
>>>> call sigreg(anint)
>>>> end
>>>> ftn -o genmake_tc genmake_tc_2.f genmake_tc_1.o
>>>> --> set HAVE_SIGREG='t'
>>>> 
>>>> 
>>>> Does anyone know why the Cray compilers return this error while writing the output binary file?
>>>> 
>>>> Many thanks,
>>>> Laura
>>>> <genmake.log><output_000.e4441213>_______________________________________________
>>>> MITgcm-support mailing list
>>>> MITgcm-support at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>> 
>>> --
>>> Jody Klymak    
>>> http://web.uvic.ca/~jklymak/
>>> 
>>> 
>>> 
>>> 
>>> 
>>> <data>_______________________________________________
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>> 
>> --
>> Jody Klymak    
>> http://web.uvic.ca/~jklymak/
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
> 
> --
> Jody Klymak    
> http://web.uvic.ca/~jklymak/
> 
> 
> 
> 
> 
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support




More information about the MITgcm-support mailing list