[MITgcm-support] Data formats and archiving hints

Mon Aug 3 04:50:41 EDT 2009

Jody,

there is a lot of personal taste involved in output decision. I like  
to use NetCDF output (only from the diagnostics-pkg) together with the  
gluemnc-script for medium sized to large problems (I guess, your 32  
CPU would fall into that category), if the the topology is simple (lat/ 
lon or cartesian grids, mainly because gluemnc does not work for cubed  
sphere grids). The main reason: It's compact and you can quickly look  
at results with handy tools such as ncview (http://meteora.ucsd.edu/~pierce/ncview_home_page.html 
). Generally it's much easier to share output with NetCDF (because  
it's a standard format).
However, there are some small limitations to the specific NetCDF  
output of the MITgcm, owing to generality considerations and limited  
programming man-power: The "missing_value" attribute is not  
implemented properly, sometimes annoying, and also other attributes  
and conventions/units are often non-standard e.g. coordinate variables  
on irregular grids, or vertical coordinates in diagnostics output,  
which make it sometimes less convenient to use standard tools for  
plotting (e.g. Ferret). For this reason it's sometimes necessary to  
modify the netCDF output with the help of the nco's (netcdf operators, http://nco.sourceforge.net 
, another very useful set of utilites) or other tools. If you are  
using only matlab, all of this is not very relevant. With netcdf  
(there is rdmnc.m) you can load single variables and single snapshot  
times from a monolithic file in the same way you can do it with rdmds.

On really large integrations (order 500x500 surface nodes and more), I  
do not use NetCDF, because you run quickly into a netcdf-file size  
limitation of 2GB (the MITgcm netcdf interface can handle that by  
opening new files, once you reach this limit, but it beats some of the  
purpose of netcdf). This limitation has been lifted with more recent  
versions of netcdf (3.6 I think), but only to 4GB, as far as I know.  
When you deal with really large files (order 1GB for 1 3D-field)  
netcdf becomes pretty useless as far as I am concerned.

On some platforms, there is little bit of an overhead for NetCDF  
output, but I have no idea how big it actually is (some utilities do  
not vectorize and use up to 3% of the total run time in some of my  
runs in a SX8 Vector computer). My experience is that MDS output is  
far more robust: it never fails, whereas NetCDF sometimes requires a  
little fiddling with library paths, etc, e.g., when I move to a new  
machine, it takes about ten minutes to set up a build_options file and  
compile, but with netcdf it has taken me as long as an hour (Ed will  
say, that's because I am incompetent (o;).

I do not recommend using the netcdf-pickup files (although they  
generally work fine, but you cannot change the tiling in the middle of  
a run without extra work, etc.).

Your other problem: I am afraid, that there is no automatism. You'll  
need to document your runs yourself, painfully boring as it is (I  
create hand-written tables with notes, comments, parameter values on  
it, that end up in folders, that I can never find when I need them).

Martin

On Aug 3, 2009, at 4:45 AM, Ryan Abernathey wrote:

> Hi Jody,
>
> You may have gathered from my recent posts to this list that I have  
> been wrestling with the same question. I started out several years  
> ago using MDS files but have switched to NetCDF for my latest  
> project. I have concluded that NetCDF is much better for several  
> reasons:
>
> 1) NetCDF files are not stored in memory by MATLAB. With MDS files,  
> I often ran up against MATLAB's memory limitations when dealing with  
> large 64-bit 3D data files. This is not an issue using NetCDF, as  
> the data is read directly from the filesystem only when it is  
> needed. Also, there is no "load time" when instantiating a NetCDF  
> file in MATLAB--it happens instantly. This is far superior to how  
> MDS files are handled, and consequently there is no limit on the  
> size of NetCDF files.
>
> 2) Grid and coordinate information is embedded in the NetCDF files,  
> along with units, descriptions, and time information (i.e.  
> metadata). This means that you don't need to keep referencing the  
> manual to figure out the precise spatial coordinates for each of  
> your diagnostics. Very useful.
>
> 3) The output from all timesteps is condensed into one file.  
> Combined with the ability to output different diagnostics into the  
> same file (using the diagnostics package), this means you can  
> potentially store all of the output you wish to analyze from a  
> particular run in one single file. I suspect this would solve all  
> your organizational problems.
>
> However, there is one major disadvantage, especially for large runs.
>
> * The globalFiles or useSingleCpuIO options do not work with NetCDF  
> output. Each tile writes its own file. So when your run is done you  
> have to use a script called gluemnc (available as a MATLAB or shell  
> script) to join together the different tiles into one global netCDF  
> file. (Your post gave the impression you aren't currently using this  
> option, so this extra step probably won't seem like a big deal  
> anyway.)
>
> Overall I would definitely recommend switching to netCDF. The long- 
> term benefits will outweigh the temporary pain.
>
> Hope this helps!
>
> -Ryan
>
> p.s. Many people apparently prefer to keep using MDS pickup files,  
> but that is a different thread...
>
>
> On Aug 2, 2009, at 2:48 PM, Klymak Jody wrote:
>
>>
>> Hi all,
>>
>> As an amateur numerical modeller using the MITgcm I thought I'd ask  
>> for folks' data format and archiving ideas/advice.
>>
>> I do my analysis in Matlab, and am unlikely to change that.  I've  
>> been writing the bare binary files (mds?) and reading those in fine  
>> with the matlab rdmds.m function.  It works very well, and I  
>> appreciate the effort that went into it.
>>
>> However, as I get to larger simulations (ahem, larger for me means  
>> 16 or 32 tiles instead of 4 or 8), I start to wonder about the  
>> thousands of tile files on my machine, and if that is really the  
>> most efficient way for me to be storing my data.  So:
>>
>> Is there an inherent advantage to switching to netcdf?
>>
>> To be honest I'm not sure what files are produced from the netcdf  
>> output - it looks like they are per-tile, and monolithic in that  
>> one file contains the whole run for that tile?  If correct,  how  
>> fast are they to read in matlab?  I'm running a simulation that  
>> will reach 3Gb/tile.
>>
>> Is there more meta information?  I am always flumoxed that there is  
>> no "time" in the MDS meta files, so I have to figure out what dt  
>> was for my run and multiply by iteration number.
>>
>> Parallel discussion:  How do folks organize and keep track of their  
>> model runs?  I have a large number now, and quite frankly I forget  
>> which ones are trash, and which ones I am using for my latest  
>> paper.   Sure, I have to be more organized, but rather than invent  
>> the wheel, I'd love to hear how folks who have been doing this for  
>> a while keep track.  Being lazy, automagic methods are always  
>> appreciated...
>>
>> Thanks for any thoughts folks feel like sharing...
>>
>> Cheers,  Jody
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support