[MITgcm-devel] netcdf on sx8

Jens-Olaf Beismann jbeismann at hpce.nec.com
Mon Dec 1 14:46:36 EST 2008


Hallo Martin,

bitte nochmal langsam fuer die Laien: Wie viele Dateien werden da 
geoeffnet, und multipliziert sich diese Zahl dann noch mit der Zahl der 
Tiles in deiner Zerlegung?

Auf der SX kann man nicht mehr als 100 Files gleichzeitig oeffnen - das 
gilt fuer die Zahl der Units bei Fortran-I/O. Ich glaube, dass mir 
dieses Limit bei NetCDF (C) noch nicht begegnet ist; sollte man aber im 
Hinterkopf behalten.

Viele Gruesse,

Jens-Olaf

PS an Kerstin: Die Doku-Buechlein liegen in Hamburg und warten auf den 
Weitertransport.

> I have now figured out what the difference between lab_sea and the other 
> experiments with netcdf is: In lab_sea the diagnostics package write 14 
> netcdf files. When I reduce this number to 6, then the model finishes 
> without errors, leaving me 30 files in the end: 2*(6diagnostics+regular 
> output+tave-output). Redirecting the monitor output to netcdf opens 
> additional files and the model stops again. So apparently on our sx8, we 
> can have only 30 netcdf files simultaneously.
> 
> That's really odd, and I wonder if there's something that one can do 
> about this at the compilation time of the netcdf libraries (that's why 
> there's a cc to Kerstin Fieg, who created the netcdf libraries).
> 
> Martin
> 
> On 27 Nov 2008, at 17:20, Martin Losch wrote:
> 
>> Following up on my own previous observation:
>> the error for lab_sea has not gone away, and I still don't know 
>> exactly what the problem is. But apparently, when mitgcmuv is trying 
>> to create the file for the second tile, the netcdf library routine 
>> NF_CREATE returns an error code (12) that translates into "Not enough 
>> space". I still have no idea why this error should arise. I have about 
>> 380GB of disk space available. the exact calling statement is also 
>> completely independent of the size of the problem: err = 
>> NF_CREATE(fname, NF_CLOBBER, fid). The only input is fname, which a 
>> character of length 500 (MNC_MAX_PATH).
>>
>> When I comment out the stop statement in mnc_handle_err, the model 
>> finishes with many error messages from the mnc-package (mostly invalid 
>> id) and produces a corrupted netcdf file for each of the variables 
>> that are saved after the initial problem occurs.
>>
>> All of this happens for 2 tiles (1 tile is OK obviously, because no 
>> second file is opened), regardless of doing this on 1 or 2CPU (nSx=2 
>> or nPx=2).
>>
>> To me this looks very much like a non-local problem with memory array 
>> boundaries, but I have no clue why and where this should happen. I 
>> have tried an array bound check with -eC, but that seemed to be OK. 
>> Something really fishy ...
>>
>> Any comments are welcome,
>>
>> Martin
>>
>> cc to Jens-Olaf, although he cannot reply to this list.
>>
>> Oh yes, happy thanksgiving ...
>>
>> On 30 Jun 2008, at 10:28, Martin Losch wrote:
>>
>>> Hi all,
>>>
>>> I found a funny error with netcdf in my SX8 routine test: in lab_sea/run
>>> I get this
>>> > cat STDERR.*
>>> (PID.TID 0001.0001) *** ERROR *** NetCDF ERROR:les
>>> (PID.TID 0001.0001) *** ERROR *** MNC ERROR: opening 
>>> 'phiHydLow.0000000000.t002.nc'
>>> > cat STDOUT.0001
>>>  NetCDF ERROR:
>>>  ===
>>>  Not enough space
>>>  ===
>>>  MNC ERROR: opening 'phiHydLow.0000000000.t002.nc'
>>>
>>> and in ideal_2D_oce
>>> > cat STDERR.*
>>> (PID.TID 0001.0001) *** ERROR *** NetCDF ERROR:
>>> (PID.TID 0001.0001) *** ERROR *** MNC ERROR: opening 
>>> 'flxDiag.0000036000.t004.nc'
>>> > tail STDOUT.0001
>>>  NetCDF ERROR:
>>>  ===
>>>  Not enough space
>>>  ===
>>>  MNC ERROR: opening 'flxDiag.0000036000.t004.nc'
>>>
>>> phiHydLow ist not part of the diagnostics out and flxDiag.* is only 
>>> the 4th output stream in data.diagnostics? By lucky accident I found 
>>> that the second error occurs when the model calls
>>>> C       Update the record dimension by writing the iteration number
>>>>         CALL MNC_CW_SET_UDIM(diag_mnc_bn, -1, myThid)
>>>>         CALL MNC_CW_RL_W_S('D',diag_mnc_bn,0,0,'T',myTime,myThid)  
>>>> <=======
>>>>         CALL MNC_CW_SET_UDIM(diag_mnc_bn, 0, myThid)
>>>>         CALL MNC_CW_I_W_S('I',diag_mnc_bn,0,0,'iter',myIter,myThid)
>>>>
>>> from diagnostics_out.F
>>>
>>> "not enough space" cannot refer to disks-space, as I am well below my 
>>> file number and disk-space quotas.
>>>
>>> Any idea what could be going on? The other examples with netcdf seem 
>>> to be doing fine (and in  our "production" runs we generally don't 
>>> have problems with MITgcm+netcdf  ...)
>>>
>>> Martin
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> 


-- 
Dr. Jens-Olaf Beismann         Benchmarking Analyst
High Performance Computing     NEC Deutschland GmbH
Tel: +49 431 2372063 (office)       +49 160 1835289 (mobile)
http://www.nec.de/home/        Fax: +49 431 2372170
---
NEC Deutschland GmbH, Hansaallee 101, D-40549 Duesseldorf
Geschaeftsfuehrer: Yuya Momose
Handelsregister Duesseldorf, HRB 57941, VAT ID DE129424743



More information about the MITgcm-devel mailing list