[MITgcm-devel] Different Flavors of MITgcm netCDF

Fri Jan 19 14:07:04 EST 2018

Dear MITgcm Colleagues,

Below is a long meditation on the status of netCDF within the MITgcm
ecosystem. I welcome your feedback and thoughts on this topic.
Preamble

First, let me acknowledge and thank everyone who is working to distribute
model data in a user-friendly format. This is important and noble work,
usually done without much incentive in terms of recognition or
publications. My intent here is not to criticize any of this selfless
effort. My goal is to ensure that, going forward, the data we distribute
has the highest possible value to the community.

My interest in this topic evolved from my goal to build useful tools for
post-processing analysis of gcm output. The python package Xarray
<https://render.githubusercontent.com/view/xarray.pydata.org/en/latest/> is
increasingly being adopted for this purpose, due to its ease of use,
powerful set of computational methods, ability to read all common data
formats, and scalability to distributed systems. (We now have an NSF award
to support development of these tools--the name of this project is Pangeo
<https://pangeo-data.github.io/>.) I in particular have been working on a
tool called xgcm <http://xgcm.readthedocs.io/en/latest/>, which is designed
to intelligently handle operations on finite-volume grids, providing
grid-aware operations such as difference, interpolation, and cumsum. Xgcm
consumes and produces xarray DataArrays. Because the goal of xgcm is to
work with *any* model output, I have become acutely aware of the
differences between different data formats. Some choices made when
producing data products can make it very easy for xarray and xgcm to
process the data. Other choices can make it very hard.

There are two set of metadata conventions which are relevant to this
discussion

   - CF Conventions
   <http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html>,
   which apply to nearly all climate data
   - Comodo conventions <http://pycomodo.forge.imag.fr/norm.html>, a less
   well-known but highly useful standard developed by the french NEMO group

Problem Statement
<https://render.githubusercontent.com/view/ipynb?commit=9e5c9953abc3161e504012215e3e2bc21ccc746c&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f676973742f72616265726e61742f62663463383836633035353561613961366537383230643933393864396262662f7261772f396535633939353361626333313631653530343031323231356533653262633231636363373436632f6d697467636d5f6e65746364665f636f6d70617269736f6e2e6970796e62&nwo=rabernat%2Fbf4c886c0555aa9a6e7820d9398d9bbf&path=mitgcm_netcdf_comparison.ipynb&repository_id=85725145&repository_type=Gist#Problem-Statement>

There is a proliferation of different flavors of netCDF files generated
from MITgcm simulations. Here is a partial list, based on what I am aware of

   - The native MNC package
   <http://mitgcm.org/public/r2_manual/latest/online_documents/node272.html>
netCDF
   output
   - MITgcm MDS files as read by xmitgcm
   <http://xmitgcm.readthedocs.io/en/latest/>, which turns them into xarray
   DataSets, essentially equivalent to netCDF (xmitgcm can read data with or
   without ancillary grid variables)
   - The ECCOv4 products which were distributed using a custom netCDF
   format produced offline
   - New SOSE products <http://sose.ucsd.edu/bsose_solution_Iter105.html>
   - Jody Klymak's new NF90
<https://github.com/altMITgcm/MITgcm/pull/16> package
   with parallel-netcdf support

Not all of these flavors are optimally compatible with xarray and xgcm. I
took the time to load the variables U, V, W, Theta, and Eta in all of these
different formats to examine the conventions used for the dimension names.
This is summarized in the table below:
mnc packagexmitgcmxmitgcm (no grid)ECCOv4SOSEnf90io
U U(T, Z, Y, Xp1) U(time, Z, YC, XG) U(time, k, j, i_g) UVELMASS(i1, i2,
i3, i4) Uvel(iTIME, iDEPTH, iLAT, iLON) UVEL(record, k, j, i_g)
V V(T, Z, Yp1, X) V(time, Z, YG, XC) V(time, k, j_g, i) VVELMASS(i1, i2,
i3, i4) Vvel(iTIME, iDEPTH, iLAT, iLON) VVEL(record, k, j_g, i)
W W(T, Zl, Y, X) W(time, Zl, YC, XC) W(time, k_l, j, i) WVELMASS(i1, i2,
i3, i4) Wvel(iTIME, iDEPTH, iLAT, iLON) WVEL(record, k_l, j, i)
Temp Temp(T, Z, Y, X) T(time, Z, YC, XC) T(time, k, j, i) THETA(i1, i2, i3,
i4) Theta(iTIME, iDEPTH, iLAT, iLON)
Eta Eta(T, Y, X) Eta(time, YC, XC) Eta(time, j, i) ETAN(i1, i2, i3) SSH(iTIME,
iLAT, iLON) ETAN(record, j, i)

In terms of processing with xarray and xgcm, there are a couple of
important things to note:

   1. Some flavors use unique dimensions for different points relative to
   the model grid (e.g., in mnc output, Z for variables at the cell
   vertical center, Zl for variables at the cell vertical face), while
   others don't make this distinction
   2. Some flavors keep the dimension names consistent across different
   files, while others use the same dimension name to indicate very different
   things in different files. For example, in ECCOv2, i2 is the vertical
   dimension for THETA but i2 is a horizontal dimension for ETAN.

Both of these issues can cause complications for downstream packages that
wish to analyze the data. For example, if files use conflicting names to
refer to the same dimension (issue 2), they can't easily be merged into a
single xarray dataset. Or, on the flip side, if two variables that are
actually at different spatial locations (e.g. UVEL and VVEL) have the same
dimensions (e.g. in SOSE), this gives the mistaken impression that they can
be multiplied or added directly. (In reality they need to be interpolated
from one position for the next (that's what xgcm is for.)

Below I summarize the status of these issues in the different flavors:
mnc packagexmitgcmxmitgcm (no grid)ECCOv4SOSEnf90io
dimension order (time, depth, lat, lon) ✅ ✅ ✅ ✅ ✅ ✅
consistent dimension names across files ✅ ✅ ✅ ❌ ✅ ✅
unique dimension name for different cell positions ✅ ✅ ✅ ❌ ❌ ✅
comodo attributes in metadata ❌ ✅ ✅ ❌ ❌ ✅Where to go from here?
<https://render.githubusercontent.com/view/ipynb?commit=9e5c9953abc3161e504012215e3e2bc21ccc746c&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f676973742f72616265726e61742f62663463383836633035353561613961366537383230643933393864396262662f7261772f396535633939353361626333313631653530343031323231356533653262633231636363373436632f6d697467636d5f6e65746364665f636f6d70617269736f6e2e6970796e62&nwo=rabernat%2Fbf4c886c0555aa9a6e7820d9398d9bbf&path=mitgcm_netcdf_comparison.ipynb&repository_id=85725145&repository_type=Gist#Where-to-go-from-here?>

Going forward, I think it is important for the MITgcm community to
standardize the netCDF files we are putting out in the world. Proliferating
these different formats places an extra burden on users to write
specialized postprocessing code, making our products harder to use.
Products that are easy to use will be adopted more widely.

There are two basic scenarios that I hope we can support:

   1. Outputting netCDF online
   2. Generating netCDF offline (from MDS output)

Ideally, the thing the end user sees will look the same regardless of how
it was produced.

I am very optimistic about Jody's nf90io package. The reason why netCDF is
not widely used now as an output format is that the MNC package doesn't
work in singleCpuIO mode. (Yes there are scripts to glue them together, but
they are hard to use and don't scale well to very large simulations.)
Ideally, in the future we will move away from MDS and toward generating
netCDF online.

However, 2. (Generating netCDF offline from MDS output) will likely remain
a necessity for quite a while. We should have a uniform way of doing this...
xmitgcm to the rescue?
<https://render.githubusercontent.com/view/ipynb?commit=9e5c9953abc3161e504012215e3e2bc21ccc746c&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f676973742f72616265726e61742f62663463383836633035353561613961366537383230643933393864396262662f7261772f396535633939353361626333313631653530343031323231356533653262633231636363373436632f6d697467636d5f6e65746364665f636f6d70617269736f6e2e6970796e62&nwo=rabernat%2Fbf4c886c0555aa9a6e7820d9398d9bbf&path=mitgcm_netcdf_comparison.ipynb&repository_id=85725145&repository_type=Gist#xmitgcm-to-the-rescue?>

Now comes the very opinionated part. I believe that we already have a
universal tool for generating netCDF from MDS output. It is the xmitgcm
python package <http://xmitgcm.readthedocs.io/en/latest/>. This package can
read MITgcm MDS data into xarray data structures and then write it to
netcdf. The amount of code required to do this is about two lines, so
expertise in python is not required. It produces netCDF files that are (in
my subjective opinion), ideally formatted for postprocessing. Also, it
looks like Jody adopted many of the conventions from xmitgcm for the nf90io
package, meaning that netCDF files generated through these two different
paths will look very similar.

xmitgcm is a work in progress. The choices made in terms of variable names
and dimensions were intended to mimic the MNC package as much as possible.
But ultimately I made some changes that I thought would result in more
useful and readable data. For those of you who are interested in producing
netCDF files from MDS output, *I welcome you to try out xmitgcm and see if
it suits your needs!* If you have ideas for how it can be improved, we
encourage you to engage with us on the github issues page
<https://github.com/xgcm/xmitgcm/issues>.

If you managed to get this far, you must really care about netCDF! I
welcome you thoughts and feedback.

Cheers,
Ryan

p.s. This comparison, and the code that goes with it, can be found as an
ipython notebook here:
https://gist.github.com/rabernat/bf4c886c0555aa9a6e7820d9398d9bbf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-devel/attachments/20180119/141af8de/attachment-0001.html>