[MITgcm-devel] netcdf on sx8
Jens-Olaf Beismann
jbeismann at hpce.nec.com
Mon Dec 1 14:46:36 EST 2008
Hallo Martin,
bitte nochmal langsam fuer die Laien: Wie viele Dateien werden da
geoeffnet, und multipliziert sich diese Zahl dann noch mit der Zahl der
Tiles in deiner Zerlegung?
Auf der SX kann man nicht mehr als 100 Files gleichzeitig oeffnen - das
gilt fuer die Zahl der Units bei Fortran-I/O. Ich glaube, dass mir
dieses Limit bei NetCDF (C) noch nicht begegnet ist; sollte man aber im
Hinterkopf behalten.
Viele Gruesse,
Jens-Olaf
PS an Kerstin: Die Doku-Buechlein liegen in Hamburg und warten auf den
Weitertransport.
> I have now figured out what the difference between lab_sea and the other
> experiments with netcdf is: In lab_sea the diagnostics package write 14
> netcdf files. When I reduce this number to 6, then the model finishes
> without errors, leaving me 30 files in the end: 2*(6diagnostics+regular
> output+tave-output). Redirecting the monitor output to netcdf opens
> additional files and the model stops again. So apparently on our sx8, we
> can have only 30 netcdf files simultaneously.
>
> That's really odd, and I wonder if there's something that one can do
> about this at the compilation time of the netcdf libraries (that's why
> there's a cc to Kerstin Fieg, who created the netcdf libraries).
>
> Martin
>
> On 27 Nov 2008, at 17:20, Martin Losch wrote:
>
>> Following up on my own previous observation:
>> the error for lab_sea has not gone away, and I still don't know
>> exactly what the problem is. But apparently, when mitgcmuv is trying
>> to create the file for the second tile, the netcdf library routine
>> NF_CREATE returns an error code (12) that translates into "Not enough
>> space". I still have no idea why this error should arise. I have about
>> 380GB of disk space available. the exact calling statement is also
>> completely independent of the size of the problem: err =
>> NF_CREATE(fname, NF_CLOBBER, fid). The only input is fname, which a
>> character of length 500 (MNC_MAX_PATH).
>>
>> When I comment out the stop statement in mnc_handle_err, the model
>> finishes with many error messages from the mnc-package (mostly invalid
>> id) and produces a corrupted netcdf file for each of the variables
>> that are saved after the initial problem occurs.
>>
>> All of this happens for 2 tiles (1 tile is OK obviously, because no
>> second file is opened), regardless of doing this on 1 or 2CPU (nSx=2
>> or nPx=2).
>>
>> To me this looks very much like a non-local problem with memory array
>> boundaries, but I have no clue why and where this should happen. I
>> have tried an array bound check with -eC, but that seemed to be OK.
>> Something really fishy ...
>>
>> Any comments are welcome,
>>
>> Martin
>>
>> cc to Jens-Olaf, although he cannot reply to this list.
>>
>> Oh yes, happy thanksgiving ...
>>
>> On 30 Jun 2008, at 10:28, Martin Losch wrote:
>>
>>> Hi all,
>>>
>>> I found a funny error with netcdf in my SX8 routine test: in lab_sea/run
>>> I get this
>>> > cat STDERR.*
>>> (PID.TID 0001.0001) *** ERROR *** NetCDF ERROR:les
>>> (PID.TID 0001.0001) *** ERROR *** MNC ERROR: opening
>>> 'phiHydLow.0000000000.t002.nc'
>>> > cat STDOUT.0001
>>> NetCDF ERROR:
>>> ===
>>> Not enough space
>>> ===
>>> MNC ERROR: opening 'phiHydLow.0000000000.t002.nc'
>>>
>>> and in ideal_2D_oce
>>> > cat STDERR.*
>>> (PID.TID 0001.0001) *** ERROR *** NetCDF ERROR:
>>> (PID.TID 0001.0001) *** ERROR *** MNC ERROR: opening
>>> 'flxDiag.0000036000.t004.nc'
>>> > tail STDOUT.0001
>>> NetCDF ERROR:
>>> ===
>>> Not enough space
>>> ===
>>> MNC ERROR: opening 'flxDiag.0000036000.t004.nc'
>>>
>>> phiHydLow ist not part of the diagnostics out and flxDiag.* is only
>>> the 4th output stream in data.diagnostics? By lucky accident I found
>>> that the second error occurs when the model calls
>>>> C Update the record dimension by writing the iteration number
>>>> CALL MNC_CW_SET_UDIM(diag_mnc_bn, -1, myThid)
>>>> CALL MNC_CW_RL_W_S('D',diag_mnc_bn,0,0,'T',myTime,myThid)
>>>> <=======
>>>> CALL MNC_CW_SET_UDIM(diag_mnc_bn, 0, myThid)
>>>> CALL MNC_CW_I_W_S('I',diag_mnc_bn,0,0,'iter',myIter,myThid)
>>>>
>>> from diagnostics_out.F
>>>
>>> "not enough space" cannot refer to disks-space, as I am well below my
>>> file number and disk-space quotas.
>>>
>>> Any idea what could be going on? The other examples with netcdf seem
>>> to be doing fine (and in our "production" runs we generally don't
>>> have problems with MITgcm+netcdf ...)
>>>
>>> Martin
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
>
--
Dr. Jens-Olaf Beismann Benchmarking Analyst
High Performance Computing NEC Deutschland GmbH
Tel: +49 431 2372063 (office) +49 160 1835289 (mobile)
http://www.nec.de/home/ Fax: +49 431 2372170
---
NEC Deutschland GmbH, Hansaallee 101, D-40549 Duesseldorf
Geschaeftsfuehrer: Yuya Momose
Handelsregister Duesseldorf, HRB 57941, VAT ID DE129424743
More information about the MITgcm-devel
mailing list