[MITgcm-devel] netcdf on sx8

Martin Losch Martin.Losch at awi.de
Tue Dec 2 03:06:57 EST 2008


Hi there,

I still send this to the devel-list, because someone else may have an  
idea, what to do:

for the layman: the mnc-pkg opens 1 file for each tile and output  
stream, so if only dumpFreq>0 and monitor output is not directed to  
netcdf, then we have for the lab_sea experiment (2tiles) the  
following output files:
> grid.t001.nc
> phiHyd.0000000000.t001.nc
> phiHydLow.0000000000.t001.nc
> sice.0000000000.t001.nc
> state.0000000000.t001.nc
and
> grid.t002.nc
> phiHyd.0000000000.t002.nc
> phiHydLow.0000000000.t002.nc
> sice.0000000000.t002.nc
> state.0000000000.t002.nc
When the diagnostics pkg is turned on (as in the lab_sea experiment)  
then we get a pair for each of the output streams (there are  
currently 19) opened there. These files are NOT opened at the same  
time ever, as far as I can see there is always only one file open at  
a time, besides the STDOUT and STDERR files, which are not NetCDF files.

When I reduce the number of files (by editing data.diagnostics) so  
that the total number of files created is 30, the model gives no error.
I wrote a little test program, that creates 100 files:
>       program nctest
>
>       implicit none
>       include 'netcdf.inc'
>       integer n,m,fid,ierr
>       character*(56) fname
>
>       m=100
>       write(*,*) 'input number of files = ', m
> C      read(*,*) m
>       do n=1,m
>        write(fname,'(A,I5.5,A)') 'foo',n,'.nc'
>        write(*,*) (fname)
>        ierr = nf_create(fname,'NF_CLOBBER',fid)
>        if ( ierr .NE. NF_NOERR ) THEN
>         print *, '==='
>         print *, nf_strerror(ierr)
>         print *, '==='
>        else
>         print *, '=== ierr = ', ierr
>        endif
>       enddo
>       stop 'NORMAL END'
>       end
compiled with
> sxf90 -I/sx8/user2/awisoft/sx8/netcdf-4.0/dw/include -L/sx8/user2/ 
> awisoft/sx8/netcdf-4.0/dw/lib -o nctest nctest.F -lnetcdf
this test produces the following output:
>  input number of files =   100
>  foo00001.nc
>  === ierr =   0
>  foo00002.nc
>  === ierr =   0
>  foo00003.nc
[ ...]
>  foo00063.nc
>  === ierr =   0
>  foo00064.nc
>  ===
>  Not enough space
>
>  ===
>  foo00065.nc
>  ===
>  Not enough space
>
[...] until foo00100.nc
which makes it pretty clear: The number of NetCDF files that can be  
created with the current netcdf on this platform is limited (why it  
seems to be 30 or 31 in one case and 63 in another one beats me).

I can run the same program on the head-node (which is
> Linux sx8 2.6.5-7.283-default #1 SMP Wed Nov 29 16:55:53 UTC 2006  
> ia64 ia64 ia64 GNU/Linux
with
> ifort -I/sx8/user2/awisoft/tx7/netcdf/netcdf-4.0/include -L/sx8/ 
> user2/awisoft/tx7/netcdf/netcdf-4.0/lib -o nctest nctest.F -lnetcdf
without any problems.

Do we try to fix this, or do we change the data.diagnostics in  
lab_sea, so that there are less netcdf files? I opt for the former,  
but how?

Martin

PS for Jens-Olaf and Kerstin, you can find the files in /home/sx8/ 
mlosch/netcdftest
On 1 Dec 2008, at 20:46, Jens-Olaf Beismann wrote:

> Hallo Martin,
>
> bitte nochmal langsam fuer die Laien: Wie viele Dateien werden da  
> geoeffnet, und multipliziert sich diese Zahl dann noch mit der Zahl  
> der Tiles in deiner Zerlegung?
>
> Auf der SX kann man nicht mehr als 100 Files gleichzeitig oeffnen -  
> das gilt fuer die Zahl der Units bei Fortran-I/O. Ich glaube, dass  
> mir dieses Limit bei NetCDF (C) noch nicht begegnet ist; sollte man  
> aber im Hinterkopf behalten.
>
> Viele Gruesse,
>
> Jens-Olaf
>
> PS an Kerstin: Die Doku-Buechlein liegen in Hamburg und warten auf  
> den Weitertransport.
>
>> I have now figured out what the difference between lab_sea and the  
>> other experiments with netcdf is: In lab_sea the diagnostics  
>> package write 14 netcdf files. When I reduce this number to 6,  
>> then the model finishes without errors, leaving me 30 files in the  
>> end: 2*(6diagnostics+regular output+tave-output). Redirecting the  
>> monitor output to netcdf opens additional files and the model  
>> stops again. So apparently on our sx8, we can have only 30 netcdf  
>> files simultaneously.
>> That's really odd, and I wonder if there's something that one can  
>> do about this at the compilation time of the netcdf libraries  
>> (that's why there's a cc to Kerstin Fieg, who created the netcdf  
>> libraries).
>> Martin
>> On 27 Nov 2008, at 17:20, Martin Losch wrote:
>>> Following up on my own previous observation:
>>> the error for lab_sea has not gone away, and I still don't know  
>>> exactly what the problem is. But apparently, when mitgcmuv is  
>>> trying to create the file for the second tile, the netcdf library  
>>> routine NF_CREATE returns an error code (12) that translates into  
>>> "Not enough space". I still have no idea why this error should  
>>> arise. I have about 380GB of disk space available. the exact  
>>> calling statement is also completely independent of the size of  
>>> the problem: err = NF_CREATE(fname, NF_CLOBBER, fid). The only  
>>> input is fname, which a character of length 500 (MNC_MAX_PATH).
>>>
>>> When I comment out the stop statement in mnc_handle_err, the  
>>> model finishes with many error messages from the mnc-package  
>>> (mostly invalid id) and produces a corrupted netcdf file for each  
>>> of the variables that are saved after the initial problem occurs.
>>>
>>> All of this happens for 2 tiles (1 tile is OK obviously, because  
>>> no second file is opened), regardless of doing this on 1 or 2CPU  
>>> (nSx=2 or nPx=2).
>>>
>>> To me this looks very much like a non-local problem with memory  
>>> array boundaries, but I have no clue why and where this should  
>>> happen. I have tried an array bound check with -eC, but that  
>>> seemed to be OK. Something really fishy ...
>>>
>>> Any comments are welcome,
>>>
>>> Martin
>>>
>>> cc to Jens-Olaf, although he cannot reply to this list.
>>>
>>> Oh yes, happy thanksgiving ...
>>>
>>> On 30 Jun 2008, at 10:28, Martin Losch wrote:
>>>
>>>> Hi all,
>>>>
>>>> I found a funny error with netcdf in my SX8 routine test: in  
>>>> lab_sea/run
>>>> I get this
>>>> > cat STDERR.*
>>>> (PID.TID 0001.0001) *** ERROR *** NetCDF ERROR:les
>>>> (PID.TID 0001.0001) *** ERROR *** MNC ERROR: opening 'phiHydLow. 
>>>> 0000000000.t002.nc'
>>>> > cat STDOUT.0001
>>>>  NetCDF ERROR:
>>>>  ===
>>>>  Not enough space
>>>>  ===
>>>>  MNC ERROR: opening 'phiHydLow.0000000000.t002.nc'
>>>>
>>>> and in ideal_2D_oce
>>>> > cat STDERR.*
>>>> (PID.TID 0001.0001) *** ERROR *** NetCDF ERROR:
>>>> (PID.TID 0001.0001) *** ERROR *** MNC ERROR: opening 'flxDiag. 
>>>> 0000036000.t004.nc'
>>>> > tail STDOUT.0001
>>>>  NetCDF ERROR:
>>>>  ===
>>>>  Not enough space
>>>>  ===
>>>>  MNC ERROR: opening 'flxDiag.0000036000.t004.nc'
>>>>
>>>> phiHydLow ist not part of the diagnostics out and flxDiag.* is  
>>>> only the 4th output stream in data.diagnostics? By lucky  
>>>> accident I found that the second error occurs when the model calls
>>>>> C       Update the record dimension by writing the iteration  
>>>>> number
>>>>>         CALL MNC_CW_SET_UDIM(diag_mnc_bn, -1, myThid)
>>>>>         CALL MNC_CW_RL_W_S('D',diag_mnc_bn, 
>>>>> 0,0,'T',myTime,myThid)  <=======
>>>>>         CALL MNC_CW_SET_UDIM(diag_mnc_bn, 0, myThid)
>>>>>         CALL MNC_CW_I_W_S('I',diag_mnc_bn, 
>>>>> 0,0,'iter',myIter,myThid)
>>>>>
>>>> from diagnostics_out.F
>>>>
>>>> "not enough space" cannot refer to disks-space, as I am well  
>>>> below my file number and disk-space quotas.
>>>>
>>>> Any idea what could be going on? The other examples with netcdf  
>>>> seem to be doing fine (and in  our "production" runs we  
>>>> generally don't have problems with MITgcm+netcdf  ...)
>>>>
>>>> Martin
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
>
> -- 
> Dr. Jens-Olaf Beismann         Benchmarking Analyst
> High Performance Computing     NEC Deutschland GmbH
> Tel: +49 431 2372063 (office)       +49 160 1835289 (mobile)
> http://www.nec.de/home/        Fax: +49 431 2372170
> ---
> NEC Deutschland GmbH, Hansaallee 101, D-40549 Duesseldorf
> Geschaeftsfuehrer: Yuya Momose
> Handelsregister Duesseldorf, HRB 57941, VAT ID DE129424743




More information about the MITgcm-devel mailing list