[MITgcm-support] Data formats and archiving hints

Jean-Michel Campin jmc at ocean.mit.edu
Mon Aug 3 09:16:50 EDT 2009


Hi,

Just a little thing to add to Christopher's answer:
the way to get the content of the meta file in matlab:
>> listOfIters=[10 20];
>> [myArr,its,M]=rdmds('dynDiag',listOfIters)
and then:
>> eval(M)
Regarding the time, as Christopher wrote, it's in the meta file
but it's currently ignored by rdmds. I guess we could change it
if this is found to be useful.

Also, one could add a 1 line experiment identification
(in "data", 5th namelist, see e.g.:
 verification/tutorial_global_oce_biogeo/input/data
> the_run_name=   'Tutorial Biogeo',
)
and this get written in the meta file of pickup & diagnostics output:
>  simulation = { 'Tutorial Biogeo' };
(probably also in mnc file ? if not it should not be hard to add it).
This can help you to identify which output corresponds to which run.

And the last thing: current mnc implementation is not working
for multi-threaded simulation. This need to be fixed (one day).

Jean-Michel

On Mon, Aug 03, 2009 at 10:50:41AM +0200, Martin Losch wrote:
> Jody,
>
> there is a lot of personal taste involved in output decision. I like to 
> use NetCDF output (only from the diagnostics-pkg) together with the  
> gluemnc-script for medium sized to large problems (I guess, your 32 CPU 
> would fall into that category), if the the topology is simple (lat/lon or 
> cartesian grids, mainly because gluemnc does not work for cubed sphere 
> grids). The main reason: It's compact and you can quickly look at results 
> with handy tools such as ncview 
> (http://meteora.ucsd.edu/~pierce/ncview_home_page.html). Generally it's 
> much easier to share output with NetCDF (because it's a standard format).
> However, there are some small limitations to the specific NetCDF output 
> of the MITgcm, owing to generality considerations and limited  
> programming man-power: The "missing_value" attribute is not implemented 
> properly, sometimes annoying, and also other attributes and 
> conventions/units are often non-standard e.g. coordinate variables on 
> irregular grids, or vertical coordinates in diagnostics output, which 
> make it sometimes less convenient to use standard tools for plotting 
> (e.g. Ferret). For this reason it's sometimes necessary to modify the 
> netCDF output with the help of the nco's (netcdf operators, 
> http://nco.sourceforge.net, another very useful set of utilites) or other 
> tools. If you are using only matlab, all of this is not very relevant. 
> With netcdf (there is rdmnc.m) you can load single variables and single 
> snapshot times from a monolithic file in the same way you can do it with 
> rdmds.
>
> On really large integrations (order 500x500 surface nodes and more), I  
> do not use NetCDF, because you run quickly into a netcdf-file size  
> limitation of 2GB (the MITgcm netcdf interface can handle that by  
> opening new files, once you reach this limit, but it beats some of the  
> purpose of netcdf). This limitation has been lifted with more recent  
> versions of netcdf (3.6 I think), but only to 4GB, as far as I know.  
> When you deal with really large files (order 1GB for 1 3D-field) netcdf 
> becomes pretty useless as far as I am concerned.
>
> On some platforms, there is little bit of an overhead for NetCDF output, 
> but I have no idea how big it actually is (some utilities do not 
> vectorize and use up to 3% of the total run time in some of my runs in a 
> SX8 Vector computer). My experience is that MDS output is far more 
> robust: it never fails, whereas NetCDF sometimes requires a little 
> fiddling with library paths, etc, e.g., when I move to a new machine, it 
> takes about ten minutes to set up a build_options file and compile, but 
> with netcdf it has taken me as long as an hour (Ed will say, that's 
> because I am incompetent (o;).
>
> I do not recommend using the netcdf-pickup files (although they  
> generally work fine, but you cannot change the tiling in the middle of a 
> run without extra work, etc.).
>
> Your other problem: I am afraid, that there is no automatism. You'll  
> need to document your runs yourself, painfully boring as it is (I create 
> hand-written tables with notes, comments, parameter values on it, that 
> end up in folders, that I can never find when I need them).
>
> Martin
>
>
> On Aug 3, 2009, at 4:45 AM, Ryan Abernathey wrote:
>
>> Hi Jody,
>>
>> You may have gathered from my recent posts to this list that I have  
>> been wrestling with the same question. I started out several years ago 
>> using MDS files but have switched to NetCDF for my latest project. I 
>> have concluded that NetCDF is much better for several reasons:
>>
>> 1) NetCDF files are not stored in memory by MATLAB. With MDS files, I 
>> often ran up against MATLAB's memory limitations when dealing with  
>> large 64-bit 3D data files. This is not an issue using NetCDF, as the 
>> data is read directly from the filesystem only when it is needed. Also, 
>> there is no "load time" when instantiating a NetCDF file in MATLAB--it 
>> happens instantly. This is far superior to how MDS files are handled, 
>> and consequently there is no limit on the size of NetCDF files.
>>
>> 2) Grid and coordinate information is embedded in the NetCDF files,  
>> along with units, descriptions, and time information (i.e. metadata). 
>> This means that you don't need to keep referencing the manual to figure 
>> out the precise spatial coordinates for each of your diagnostics. Very 
>> useful.
>>
>> 3) The output from all timesteps is condensed into one file. Combined 
>> with the ability to output different diagnostics into the same file 
>> (using the diagnostics package), this means you can potentially store 
>> all of the output you wish to analyze from a particular run in one 
>> single file. I suspect this would solve all your organizational 
>> problems.
>>
>> However, there is one major disadvantage, especially for large runs.
>>
>> * The globalFiles or useSingleCpuIO options do not work with NetCDF  
>> output. Each tile writes its own file. So when your run is done you  
>> have to use a script called gluemnc (available as a MATLAB or shell  
>> script) to join together the different tiles into one global netCDF  
>> file. (Your post gave the impression you aren't currently using this  
>> option, so this extra step probably won't seem like a big deal  
>> anyway.)
>>
>> Overall I would definitely recommend switching to netCDF. The long- 
>> term benefits will outweigh the temporary pain.
>>
>> Hope this helps!
>>
>> -Ryan
>>
>> p.s. Many people apparently prefer to keep using MDS pickup files, but 
>> that is a different thread...
>>
>>
>> On Aug 2, 2009, at 2:48 PM, Klymak Jody wrote:
>>
>>>
>>> Hi all,
>>>
>>> As an amateur numerical modeller using the MITgcm I thought I'd ask  
>>> for folks' data format and archiving ideas/advice.
>>>
>>> I do my analysis in Matlab, and am unlikely to change that.  I've  
>>> been writing the bare binary files (mds?) and reading those in fine  
>>> with the matlab rdmds.m function.  It works very well, and I  
>>> appreciate the effort that went into it.
>>>
>>> However, as I get to larger simulations (ahem, larger for me means  
>>> 16 or 32 tiles instead of 4 or 8), I start to wonder about the  
>>> thousands of tile files on my machine, and if that is really the  
>>> most efficient way for me to be storing my data.  So:
>>>
>>> Is there an inherent advantage to switching to netcdf?
>>>
>>> To be honest I'm not sure what files are produced from the netcdf  
>>> output - it looks like they are per-tile, and monolithic in that one 
>>> file contains the whole run for that tile?  If correct,  how fast are 
>>> they to read in matlab?  I'm running a simulation that will reach 
>>> 3Gb/tile.
>>>
>>> Is there more meta information?  I am always flumoxed that there is  
>>> no "time" in the MDS meta files, so I have to figure out what dt was 
>>> for my run and multiply by iteration number.
>>>
>>> Parallel discussion:  How do folks organize and keep track of their  
>>> model runs?  I have a large number now, and quite frankly I forget  
>>> which ones are trash, and which ones I am using for my latest paper.  
>>>  Sure, I have to be more organized, but rather than invent the wheel, 
>>> I'd love to hear how folks who have been doing this for a while keep 
>>> track.  Being lazy, automagic methods are always appreciated...
>>>
>>> Thanks for any thoughts folks feel like sharing...
>>>
>>> Cheers,  Jody
>>> _______________________________________________
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support



More information about the MITgcm-support mailing list