[MITgcm-devel] netcdf on sx8

Jens-Olaf Beismann jbeismann at hpce.nec.com
Tue Dec 2 07:48:29 EST 2008


Hi Martin,

I just tried your test program on an SX-8 in Duesseldorf - no problems. 
Could you check your user limits on one of your SX nodes? Here's what 
ulimit -a tells me on our system:

/home/jbeismann 133: ulimit -a
time(seconds)        unlimited
sfsfile(blocks)      8589934592
memory(kbytes)       134217728
data(kbytes)         134217728
stack(kbytes)        134217728
coredump(blocks)     0
sfsspace(blocks)     unlimited
nofiles(descriptors) 256
ncpurestm(number)    1
cpurestm(seconds)    unlimited
taskuse(number)      128

Cheers,

Jens-Olaf

> I still send this to the devel-list, because someone else may have an 
> idea, what to do:
> 
> for the layman: the mnc-pkg opens 1 file for each tile and output 
> stream, so if only dumpFreq>0 and monitor output is not directed to 
> netcdf, then we have for the lab_sea experiment (2tiles) the following 
> output files:
>> grid.t001.nc
>> phiHyd.0000000000.t001.nc
>> phiHydLow.0000000000.t001.nc
>> sice.0000000000.t001.nc
>> state.0000000000.t001.nc
> and
>> grid.t002.nc
>> phiHyd.0000000000.t002.nc
>> phiHydLow.0000000000.t002.nc
>> sice.0000000000.t002.nc
>> state.0000000000.t002.nc
> When the diagnostics pkg is turned on (as in the lab_sea experiment) 
> then we get a pair for each of the output streams (there are currently 
> 19) opened there. These files are NOT opened at the same time ever, as 
> far as I can see there is always only one file open at a time, besides 
> the STDOUT and STDERR files, which are not NetCDF files.
> 
> When I reduce the number of files (by editing data.diagnostics) so that 
> the total number of files created is 30, the model gives no error.
> I wrote a little test program, that creates 100 files:
>>       program nctest
>>
>>       implicit none
>>       include 'netcdf.inc'
>>       integer n,m,fid,ierr
>>       character*(56) fname
>>
>>       m=100
>>       write(*,*) 'input number of files = ', m
>> C      read(*,*) m
>>       do n=1,m
>>        write(fname,'(A,I5.5,A)') 'foo',n,'.nc'
>>        write(*,*) (fname)
>>        ierr = nf_create(fname,'NF_CLOBBER',fid)
>>        if ( ierr .NE. NF_NOERR ) THEN
>>         print *, '==='
>>         print *, nf_strerror(ierr)
>>         print *, '==='
>>        else
>>         print *, '=== ierr = ', ierr
>>        endif
>>       enddo
>>       stop 'NORMAL END'
>>       end
> compiled with
>> sxf90 -I/sx8/user2/awisoft/sx8/netcdf-4.0/dw/include 
>> -L/sx8/user2/awisoft/sx8/netcdf-4.0/dw/lib -o nctest nctest.F -lnetcdf
> this test produces the following output:
>>  input number of files =   100
>>  foo00001.nc
>>  === ierr =   0
>>  foo00002.nc
>>  === ierr =   0
>>  foo00003.nc
> [ ...]
>>  foo00063.nc
>>  === ierr =   0
>>  foo00064.nc
>>  ===
>>  Not enough space
>>
>>  ===
>>  foo00065.nc
>>  ===
>>  Not enough space
>>
> [...] until foo00100.nc
> which makes it pretty clear: The number of NetCDF files that can be 
> created with the current netcdf on this platform is limited (why it 
> seems to be 30 or 31 in one case and 63 in another one beats me).
> 
> I can run the same program on the head-node (which is
>> Linux sx8 2.6.5-7.283-default #1 SMP Wed Nov 29 16:55:53 UTC 2006 ia64 
>> ia64 ia64 GNU/Linux
> with
>> ifort -I/sx8/user2/awisoft/tx7/netcdf/netcdf-4.0/include 
>> -L/sx8/user2/awisoft/tx7/netcdf/netcdf-4.0/lib -o nctest nctest.F 
>> -lnetcdf
> without any problems.
> 
> Do we try to fix this, or do we change the data.diagnostics in lab_sea, 
> so that there are less netcdf files? I opt for the former, but how?
> 
> Martin
> 
> PS for Jens-Olaf and Kerstin, you can find the files in 
> /home/sx8/mlosch/netcdftest
> On 1 Dec 2008, at 20:46, Jens-Olaf Beismann wrote:
> 
>> Hallo Martin,
>>
>> bitte nochmal langsam fuer die Laien: Wie viele Dateien werden da 
>> geoeffnet, und multipliziert sich diese Zahl dann noch mit der Zahl 
>> der Tiles in deiner Zerlegung?
>>
>> Auf der SX kann man nicht mehr als 100 Files gleichzeitig oeffnen - 
>> das gilt fuer die Zahl der Units bei Fortran-I/O. Ich glaube, dass mir 
>> dieses Limit bei NetCDF (C) noch nicht begegnet ist; sollte man aber 
>> im Hinterkopf behalten.
>>
>> Viele Gruesse,
>>
>> Jens-Olaf
>>
>> PS an Kerstin: Die Doku-Buechlein liegen in Hamburg und warten auf den 
>> Weitertransport.
>>
>>> I have now figured out what the difference between lab_sea and the 
>>> other experiments with netcdf is: In lab_sea the diagnostics package 
>>> write 14 netcdf files. When I reduce this number to 6, then the model 
>>> finishes without errors, leaving me 30 files in the end: 
>>> 2*(6diagnostics+regular output+tave-output). Redirecting the monitor 
>>> output to netcdf opens additional files and the model stops again. So 
>>> apparently on our sx8, we can have only 30 netcdf files simultaneously.
>>> That's really odd, and I wonder if there's something that one can do 
>>> about this at the compilation time of the netcdf libraries (that's 
>>> why there's a cc to Kerstin Fieg, who created the netcdf libraries).
>>> Martin
>>> On 27 Nov 2008, at 17:20, Martin Losch wrote:
>>>> Following up on my own previous observation:
>>>> the error for lab_sea has not gone away, and I still don't know 
>>>> exactly what the problem is. But apparently, when mitgcmuv is trying 
>>>> to create the file for the second tile, the netcdf library routine 
>>>> NF_CREATE returns an error code (12) that translates into "Not 
>>>> enough space". I still have no idea why this error should arise. I 
>>>> have about 380GB of disk space available. the exact calling 
>>>> statement is also completely independent of the size of the problem: 
>>>> err = NF_CREATE(fname, NF_CLOBBER, fid). The only input is fname, 
>>>> which a character of length 500 (MNC_MAX_PATH).
>>>>
>>>> When I comment out the stop statement in mnc_handle_err, the model 
>>>> finishes with many error messages from the mnc-package (mostly 
>>>> invalid id) and produces a corrupted netcdf file for each of the 
>>>> variables that are saved after the initial problem occurs.
>>>>
>>>> All of this happens for 2 tiles (1 tile is OK obviously, because no 
>>>> second file is opened), regardless of doing this on 1 or 2CPU (nSx=2 
>>>> or nPx=2).
>>>>
>>>> To me this looks very much like a non-local problem with memory 
>>>> array boundaries, but I have no clue why and where this should 
>>>> happen. I have tried an array bound check with -eC, but that seemed 
>>>> to be OK. Something really fishy ...
>>>>
>>>> Any comments are welcome,
>>>>
>>>> Martin
>>>>
>>>> cc to Jens-Olaf, although he cannot reply to this list.
>>>>
>>>> Oh yes, happy thanksgiving ...
>>>>
>>>> On 30 Jun 2008, at 10:28, Martin Losch wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I found a funny error with netcdf in my SX8 routine test: in 
>>>>> lab_sea/run
>>>>> I get this
>>>>> > cat STDERR.*
>>>>> (PID.TID 0001.0001) *** ERROR *** NetCDF ERROR:les
>>>>> (PID.TID 0001.0001) *** ERROR *** MNC ERROR: opening 
>>>>> 'phiHydLow.0000000000.t002.nc'
>>>>> > cat STDOUT.0001
>>>>>  NetCDF ERROR:
>>>>>  ===
>>>>>  Not enough space
>>>>>  ===
>>>>>  MNC ERROR: opening 'phiHydLow.0000000000.t002.nc'
>>>>>
>>>>> and in ideal_2D_oce
>>>>> > cat STDERR.*
>>>>> (PID.TID 0001.0001) *** ERROR *** NetCDF ERROR:
>>>>> (PID.TID 0001.0001) *** ERROR *** MNC ERROR: opening 
>>>>> 'flxDiag.0000036000.t004.nc'
>>>>> > tail STDOUT.0001
>>>>>  NetCDF ERROR:
>>>>>  ===
>>>>>  Not enough space
>>>>>  ===
>>>>>  MNC ERROR: opening 'flxDiag.0000036000.t004.nc'
>>>>>
>>>>> phiHydLow ist not part of the diagnostics out and flxDiag.* is only 
>>>>> the 4th output stream in data.diagnostics? By lucky accident I 
>>>>> found that the second error occurs when the model calls
>>>>>> C       Update the record dimension by writing the iteration number
>>>>>>         CALL MNC_CW_SET_UDIM(diag_mnc_bn, -1, myThid)
>>>>>>         CALL MNC_CW_RL_W_S('D',diag_mnc_bn,0,0,'T',myTime,myThid)  
>>>>>> <=======
>>>>>>         CALL MNC_CW_SET_UDIM(diag_mnc_bn, 0, myThid)
>>>>>>         CALL MNC_CW_I_W_S('I',diag_mnc_bn,0,0,'iter',myIter,myThid)
>>>>>>
>>>>> from diagnostics_out.F
>>>>>
>>>>> "not enough space" cannot refer to disks-space, as I am well below 
>>>>> my file number and disk-space quotas.
>>>>>
>>>>> Any idea what could be going on? The other examples with netcdf 
>>>>> seem to be doing fine (and in  our "production" runs we generally 
>>>>> don't have problems with MITgcm+netcdf  ...)
>>>>>
>>>>> Martin
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>>
>> -- 
>> Dr. Jens-Olaf Beismann         Benchmarking Analyst
>> High Performance Computing     NEC Deutschland GmbH
>> Tel: +49 431 2372063 (office)       +49 160 1835289 (mobile)
>> http://www.nec.de/home/        Fax: +49 431 2372170
>> ---
>> NEC Deutschland GmbH, Hansaallee 101, D-40549 Duesseldorf
>> Geschaeftsfuehrer: Yuya Momose
>> Handelsregister Duesseldorf, HRB 57941, VAT ID DE129424743
> 
> 


-- 
Dr. Jens-Olaf Beismann         Benchmarking Analyst
High Performance Computing     NEC Deutschland GmbH
Tel: +49 431 2372063 (office)       +49 160 1835289 (mobile)
http://www.nec.de/home/        Fax: +49 431 2372170
---
NEC Deutschland GmbH, Hansaallee 101, D-40549 Duesseldorf
Geschaeftsfuehrer: Yuya Momose
Handelsregister Duesseldorf, HRB 57941, VAT ID DE129424743



More information about the MITgcm-devel mailing list