[MITgcm-support] Data formats and archiving hints
Jean-Michel Campin
jmc at ocean.mit.edu
Mon Aug 3 09:16:50 EDT 2009
Hi,
Just a little thing to add to Christopher's answer:
the way to get the content of the meta file in matlab:
>> listOfIters=[10 20];
>> [myArr,its,M]=rdmds('dynDiag',listOfIters)
and then:
>> eval(M)
Regarding the time, as Christopher wrote, it's in the meta file
but it's currently ignored by rdmds. I guess we could change it
if this is found to be useful.
Also, one could add a 1 line experiment identification
(in "data", 5th namelist, see e.g.:
verification/tutorial_global_oce_biogeo/input/data
> the_run_name= 'Tutorial Biogeo',
)
and this get written in the meta file of pickup & diagnostics output:
> simulation = { 'Tutorial Biogeo' };
(probably also in mnc file ? if not it should not be hard to add it).
This can help you to identify which output corresponds to which run.
And the last thing: current mnc implementation is not working
for multi-threaded simulation. This need to be fixed (one day).
Jean-Michel
On Mon, Aug 03, 2009 at 10:50:41AM +0200, Martin Losch wrote:
> Jody,
>
> there is a lot of personal taste involved in output decision. I like to
> use NetCDF output (only from the diagnostics-pkg) together with the
> gluemnc-script for medium sized to large problems (I guess, your 32 CPU
> would fall into that category), if the the topology is simple (lat/lon or
> cartesian grids, mainly because gluemnc does not work for cubed sphere
> grids). The main reason: It's compact and you can quickly look at results
> with handy tools such as ncview
> (http://meteora.ucsd.edu/~pierce/ncview_home_page.html). Generally it's
> much easier to share output with NetCDF (because it's a standard format).
> However, there are some small limitations to the specific NetCDF output
> of the MITgcm, owing to generality considerations and limited
> programming man-power: The "missing_value" attribute is not implemented
> properly, sometimes annoying, and also other attributes and
> conventions/units are often non-standard e.g. coordinate variables on
> irregular grids, or vertical coordinates in diagnostics output, which
> make it sometimes less convenient to use standard tools for plotting
> (e.g. Ferret). For this reason it's sometimes necessary to modify the
> netCDF output with the help of the nco's (netcdf operators,
> http://nco.sourceforge.net, another very useful set of utilites) or other
> tools. If you are using only matlab, all of this is not very relevant.
> With netcdf (there is rdmnc.m) you can load single variables and single
> snapshot times from a monolithic file in the same way you can do it with
> rdmds.
>
> On really large integrations (order 500x500 surface nodes and more), I
> do not use NetCDF, because you run quickly into a netcdf-file size
> limitation of 2GB (the MITgcm netcdf interface can handle that by
> opening new files, once you reach this limit, but it beats some of the
> purpose of netcdf). This limitation has been lifted with more recent
> versions of netcdf (3.6 I think), but only to 4GB, as far as I know.
> When you deal with really large files (order 1GB for 1 3D-field) netcdf
> becomes pretty useless as far as I am concerned.
>
> On some platforms, there is little bit of an overhead for NetCDF output,
> but I have no idea how big it actually is (some utilities do not
> vectorize and use up to 3% of the total run time in some of my runs in a
> SX8 Vector computer). My experience is that MDS output is far more
> robust: it never fails, whereas NetCDF sometimes requires a little
> fiddling with library paths, etc, e.g., when I move to a new machine, it
> takes about ten minutes to set up a build_options file and compile, but
> with netcdf it has taken me as long as an hour (Ed will say, that's
> because I am incompetent (o;).
>
> I do not recommend using the netcdf-pickup files (although they
> generally work fine, but you cannot change the tiling in the middle of a
> run without extra work, etc.).
>
> Your other problem: I am afraid, that there is no automatism. You'll
> need to document your runs yourself, painfully boring as it is (I create
> hand-written tables with notes, comments, parameter values on it, that
> end up in folders, that I can never find when I need them).
>
> Martin
>
>
> On Aug 3, 2009, at 4:45 AM, Ryan Abernathey wrote:
>
>> Hi Jody,
>>
>> You may have gathered from my recent posts to this list that I have
>> been wrestling with the same question. I started out several years ago
>> using MDS files but have switched to NetCDF for my latest project. I
>> have concluded that NetCDF is much better for several reasons:
>>
>> 1) NetCDF files are not stored in memory by MATLAB. With MDS files, I
>> often ran up against MATLAB's memory limitations when dealing with
>> large 64-bit 3D data files. This is not an issue using NetCDF, as the
>> data is read directly from the filesystem only when it is needed. Also,
>> there is no "load time" when instantiating a NetCDF file in MATLAB--it
>> happens instantly. This is far superior to how MDS files are handled,
>> and consequently there is no limit on the size of NetCDF files.
>>
>> 2) Grid and coordinate information is embedded in the NetCDF files,
>> along with units, descriptions, and time information (i.e. metadata).
>> This means that you don't need to keep referencing the manual to figure
>> out the precise spatial coordinates for each of your diagnostics. Very
>> useful.
>>
>> 3) The output from all timesteps is condensed into one file. Combined
>> with the ability to output different diagnostics into the same file
>> (using the diagnostics package), this means you can potentially store
>> all of the output you wish to analyze from a particular run in one
>> single file. I suspect this would solve all your organizational
>> problems.
>>
>> However, there is one major disadvantage, especially for large runs.
>>
>> * The globalFiles or useSingleCpuIO options do not work with NetCDF
>> output. Each tile writes its own file. So when your run is done you
>> have to use a script called gluemnc (available as a MATLAB or shell
>> script) to join together the different tiles into one global netCDF
>> file. (Your post gave the impression you aren't currently using this
>> option, so this extra step probably won't seem like a big deal
>> anyway.)
>>
>> Overall I would definitely recommend switching to netCDF. The long-
>> term benefits will outweigh the temporary pain.
>>
>> Hope this helps!
>>
>> -Ryan
>>
>> p.s. Many people apparently prefer to keep using MDS pickup files, but
>> that is a different thread...
>>
>>
>> On Aug 2, 2009, at 2:48 PM, Klymak Jody wrote:
>>
>>>
>>> Hi all,
>>>
>>> As an amateur numerical modeller using the MITgcm I thought I'd ask
>>> for folks' data format and archiving ideas/advice.
>>>
>>> I do my analysis in Matlab, and am unlikely to change that. I've
>>> been writing the bare binary files (mds?) and reading those in fine
>>> with the matlab rdmds.m function. It works very well, and I
>>> appreciate the effort that went into it.
>>>
>>> However, as I get to larger simulations (ahem, larger for me means
>>> 16 or 32 tiles instead of 4 or 8), I start to wonder about the
>>> thousands of tile files on my machine, and if that is really the
>>> most efficient way for me to be storing my data. So:
>>>
>>> Is there an inherent advantage to switching to netcdf?
>>>
>>> To be honest I'm not sure what files are produced from the netcdf
>>> output - it looks like they are per-tile, and monolithic in that one
>>> file contains the whole run for that tile? If correct, how fast are
>>> they to read in matlab? I'm running a simulation that will reach
>>> 3Gb/tile.
>>>
>>> Is there more meta information? I am always flumoxed that there is
>>> no "time" in the MDS meta files, so I have to figure out what dt was
>>> for my run and multiply by iteration number.
>>>
>>> Parallel discussion: How do folks organize and keep track of their
>>> model runs? I have a large number now, and quite frankly I forget
>>> which ones are trash, and which ones I am using for my latest paper.
>>> Sure, I have to be more organized, but rather than invent the wheel,
>>> I'd love to hear how folks who have been doing this for a while keep
>>> track. Being lazy, automagic methods are always appreciated...
>>>
>>> Thanks for any thoughts folks feel like sharing...
>>>
>>> Cheers, Jody
>>> _______________________________________________
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
More information about the MITgcm-support
mailing list