[MITgcm-devel] netcdf on sx8
Jens-Olaf Beismann
jbeismann at hpce.nec.com
Tue Dec 2 07:48:29 EST 2008
Hi Martin,
I just tried your test program on an SX-8 in Duesseldorf - no problems.
Could you check your user limits on one of your SX nodes? Here's what
ulimit -a tells me on our system:
/home/jbeismann 133: ulimit -a
time(seconds) unlimited
sfsfile(blocks) 8589934592
memory(kbytes) 134217728
data(kbytes) 134217728
stack(kbytes) 134217728
coredump(blocks) 0
sfsspace(blocks) unlimited
nofiles(descriptors) 256
ncpurestm(number) 1
cpurestm(seconds) unlimited
taskuse(number) 128
Cheers,
Jens-Olaf
> I still send this to the devel-list, because someone else may have an
> idea, what to do:
>
> for the layman: the mnc-pkg opens 1 file for each tile and output
> stream, so if only dumpFreq>0 and monitor output is not directed to
> netcdf, then we have for the lab_sea experiment (2tiles) the following
> output files:
>> grid.t001.nc
>> phiHyd.0000000000.t001.nc
>> phiHydLow.0000000000.t001.nc
>> sice.0000000000.t001.nc
>> state.0000000000.t001.nc
> and
>> grid.t002.nc
>> phiHyd.0000000000.t002.nc
>> phiHydLow.0000000000.t002.nc
>> sice.0000000000.t002.nc
>> state.0000000000.t002.nc
> When the diagnostics pkg is turned on (as in the lab_sea experiment)
> then we get a pair for each of the output streams (there are currently
> 19) opened there. These files are NOT opened at the same time ever, as
> far as I can see there is always only one file open at a time, besides
> the STDOUT and STDERR files, which are not NetCDF files.
>
> When I reduce the number of files (by editing data.diagnostics) so that
> the total number of files created is 30, the model gives no error.
> I wrote a little test program, that creates 100 files:
>> program nctest
>>
>> implicit none
>> include 'netcdf.inc'
>> integer n,m,fid,ierr
>> character*(56) fname
>>
>> m=100
>> write(*,*) 'input number of files = ', m
>> C read(*,*) m
>> do n=1,m
>> write(fname,'(A,I5.5,A)') 'foo',n,'.nc'
>> write(*,*) (fname)
>> ierr = nf_create(fname,'NF_CLOBBER',fid)
>> if ( ierr .NE. NF_NOERR ) THEN
>> print *, '==='
>> print *, nf_strerror(ierr)
>> print *, '==='
>> else
>> print *, '=== ierr = ', ierr
>> endif
>> enddo
>> stop 'NORMAL END'
>> end
> compiled with
>> sxf90 -I/sx8/user2/awisoft/sx8/netcdf-4.0/dw/include
>> -L/sx8/user2/awisoft/sx8/netcdf-4.0/dw/lib -o nctest nctest.F -lnetcdf
> this test produces the following output:
>> input number of files = 100
>> foo00001.nc
>> === ierr = 0
>> foo00002.nc
>> === ierr = 0
>> foo00003.nc
> [ ...]
>> foo00063.nc
>> === ierr = 0
>> foo00064.nc
>> ===
>> Not enough space
>>
>> ===
>> foo00065.nc
>> ===
>> Not enough space
>>
> [...] until foo00100.nc
> which makes it pretty clear: The number of NetCDF files that can be
> created with the current netcdf on this platform is limited (why it
> seems to be 30 or 31 in one case and 63 in another one beats me).
>
> I can run the same program on the head-node (which is
>> Linux sx8 2.6.5-7.283-default #1 SMP Wed Nov 29 16:55:53 UTC 2006 ia64
>> ia64 ia64 GNU/Linux
> with
>> ifort -I/sx8/user2/awisoft/tx7/netcdf/netcdf-4.0/include
>> -L/sx8/user2/awisoft/tx7/netcdf/netcdf-4.0/lib -o nctest nctest.F
>> -lnetcdf
> without any problems.
>
> Do we try to fix this, or do we change the data.diagnostics in lab_sea,
> so that there are less netcdf files? I opt for the former, but how?
>
> Martin
>
> PS for Jens-Olaf and Kerstin, you can find the files in
> /home/sx8/mlosch/netcdftest
> On 1 Dec 2008, at 20:46, Jens-Olaf Beismann wrote:
>
>> Hallo Martin,
>>
>> bitte nochmal langsam fuer die Laien: Wie viele Dateien werden da
>> geoeffnet, und multipliziert sich diese Zahl dann noch mit der Zahl
>> der Tiles in deiner Zerlegung?
>>
>> Auf der SX kann man nicht mehr als 100 Files gleichzeitig oeffnen -
>> das gilt fuer die Zahl der Units bei Fortran-I/O. Ich glaube, dass mir
>> dieses Limit bei NetCDF (C) noch nicht begegnet ist; sollte man aber
>> im Hinterkopf behalten.
>>
>> Viele Gruesse,
>>
>> Jens-Olaf
>>
>> PS an Kerstin: Die Doku-Buechlein liegen in Hamburg und warten auf den
>> Weitertransport.
>>
>>> I have now figured out what the difference between lab_sea and the
>>> other experiments with netcdf is: In lab_sea the diagnostics package
>>> write 14 netcdf files. When I reduce this number to 6, then the model
>>> finishes without errors, leaving me 30 files in the end:
>>> 2*(6diagnostics+regular output+tave-output). Redirecting the monitor
>>> output to netcdf opens additional files and the model stops again. So
>>> apparently on our sx8, we can have only 30 netcdf files simultaneously.
>>> That's really odd, and I wonder if there's something that one can do
>>> about this at the compilation time of the netcdf libraries (that's
>>> why there's a cc to Kerstin Fieg, who created the netcdf libraries).
>>> Martin
>>> On 27 Nov 2008, at 17:20, Martin Losch wrote:
>>>> Following up on my own previous observation:
>>>> the error for lab_sea has not gone away, and I still don't know
>>>> exactly what the problem is. But apparently, when mitgcmuv is trying
>>>> to create the file for the second tile, the netcdf library routine
>>>> NF_CREATE returns an error code (12) that translates into "Not
>>>> enough space". I still have no idea why this error should arise. I
>>>> have about 380GB of disk space available. the exact calling
>>>> statement is also completely independent of the size of the problem:
>>>> err = NF_CREATE(fname, NF_CLOBBER, fid). The only input is fname,
>>>> which a character of length 500 (MNC_MAX_PATH).
>>>>
>>>> When I comment out the stop statement in mnc_handle_err, the model
>>>> finishes with many error messages from the mnc-package (mostly
>>>> invalid id) and produces a corrupted netcdf file for each of the
>>>> variables that are saved after the initial problem occurs.
>>>>
>>>> All of this happens for 2 tiles (1 tile is OK obviously, because no
>>>> second file is opened), regardless of doing this on 1 or 2CPU (nSx=2
>>>> or nPx=2).
>>>>
>>>> To me this looks very much like a non-local problem with memory
>>>> array boundaries, but I have no clue why and where this should
>>>> happen. I have tried an array bound check with -eC, but that seemed
>>>> to be OK. Something really fishy ...
>>>>
>>>> Any comments are welcome,
>>>>
>>>> Martin
>>>>
>>>> cc to Jens-Olaf, although he cannot reply to this list.
>>>>
>>>> Oh yes, happy thanksgiving ...
>>>>
>>>> On 30 Jun 2008, at 10:28, Martin Losch wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I found a funny error with netcdf in my SX8 routine test: in
>>>>> lab_sea/run
>>>>> I get this
>>>>> > cat STDERR.*
>>>>> (PID.TID 0001.0001) *** ERROR *** NetCDF ERROR:les
>>>>> (PID.TID 0001.0001) *** ERROR *** MNC ERROR: opening
>>>>> 'phiHydLow.0000000000.t002.nc'
>>>>> > cat STDOUT.0001
>>>>> NetCDF ERROR:
>>>>> ===
>>>>> Not enough space
>>>>> ===
>>>>> MNC ERROR: opening 'phiHydLow.0000000000.t002.nc'
>>>>>
>>>>> and in ideal_2D_oce
>>>>> > cat STDERR.*
>>>>> (PID.TID 0001.0001) *** ERROR *** NetCDF ERROR:
>>>>> (PID.TID 0001.0001) *** ERROR *** MNC ERROR: opening
>>>>> 'flxDiag.0000036000.t004.nc'
>>>>> > tail STDOUT.0001
>>>>> NetCDF ERROR:
>>>>> ===
>>>>> Not enough space
>>>>> ===
>>>>> MNC ERROR: opening 'flxDiag.0000036000.t004.nc'
>>>>>
>>>>> phiHydLow ist not part of the diagnostics out and flxDiag.* is only
>>>>> the 4th output stream in data.diagnostics? By lucky accident I
>>>>> found that the second error occurs when the model calls
>>>>>> C Update the record dimension by writing the iteration number
>>>>>> CALL MNC_CW_SET_UDIM(diag_mnc_bn, -1, myThid)
>>>>>> CALL MNC_CW_RL_W_S('D',diag_mnc_bn,0,0,'T',myTime,myThid)
>>>>>> <=======
>>>>>> CALL MNC_CW_SET_UDIM(diag_mnc_bn, 0, myThid)
>>>>>> CALL MNC_CW_I_W_S('I',diag_mnc_bn,0,0,'iter',myIter,myThid)
>>>>>>
>>>>> from diagnostics_out.F
>>>>>
>>>>> "not enough space" cannot refer to disks-space, as I am well below
>>>>> my file number and disk-space quotas.
>>>>>
>>>>> Any idea what could be going on? The other examples with netcdf
>>>>> seem to be doing fine (and in our "production" runs we generally
>>>>> don't have problems with MITgcm+netcdf ...)
>>>>>
>>>>> Martin
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>>
>> --
>> Dr. Jens-Olaf Beismann Benchmarking Analyst
>> High Performance Computing NEC Deutschland GmbH
>> Tel: +49 431 2372063 (office) +49 160 1835289 (mobile)
>> http://www.nec.de/home/ Fax: +49 431 2372170
>> ---
>> NEC Deutschland GmbH, Hansaallee 101, D-40549 Duesseldorf
>> Geschaeftsfuehrer: Yuya Momose
>> Handelsregister Duesseldorf, HRB 57941, VAT ID DE129424743
>
>
--
Dr. Jens-Olaf Beismann Benchmarking Analyst
High Performance Computing NEC Deutschland GmbH
Tel: +49 431 2372063 (office) +49 160 1835289 (mobile)
http://www.nec.de/home/ Fax: +49 431 2372170
---
NEC Deutschland GmbH, Hansaallee 101, D-40549 Duesseldorf
Geschaeftsfuehrer: Yuya Momose
Handelsregister Duesseldorf, HRB 57941, VAT ID DE129424743
More information about the MITgcm-devel
mailing list