11

The python module xarray greatly supports loading/mapping netCDF files, even lazily with dask.

The data source I have to work with are thousands of hdf5 files, with lots of groups, datasets, attributes - all created with h5py.

The Question is: How can I load (or even better with dask, lazily map) hdf5 data (datasets, metadata,...) into an xarray dataset structure?

Has anybody experience with that or came across a similar issue? Thank you!

3
  • It is one of the basic functions - you should read the doc, try it and report back here if you have any problem. As it stands, this is not really a SO-like question, you may see negative votes. Commented Feb 11, 2019 at 15:58
  • @mdurant thank you for your comment. I will try to formulate my question clearer. Commented Feb 12, 2019 at 13:25
  • I am not familiar with the xarray module, However, h5py accesses HDF5 data as numpy record arrays. So, you simply need to access a hdf5 dataset as a record array and manipulate the data into a xarray dataset format. Commented Feb 12, 2019 at 14:31

1 Answer 1

9

One possible solution to this is to open the hdf5-file using netCDF4 in diskless non-persistence mode:

ncf = netCDF4.Dataset(hdf5file, diskless=True, persist=False)

Now you can inspect the file contents including groups.

After that you can make use of xarray.backends.NetCDF4DataStore to open the wanted hdf5-groups (xarray can only get hold of one hdf5-group at a time):

nch = ncf.groups.get('hdf5-name')
xds = xarray.open_dataset(xarray.backends.NetCDF4DataStore(nch))

This will give you a dataset xds with all attributes and variables (datasets) of the group hdf5-name. Note that you will not get access to sub-groups. You would need to claim subgroups by the same mechanism. If you want to apply dask, you would need to add the keyword chunking with wanted values.

There is no (real) automatism for decoding data like this could be done for NetCDF files. If you have a integer compressed 2d variable (dataset) var with some attributes gain and offset you can add the NetCDF specific attributes scale_factor and add_offset to the variable:

var = xds['var']
var.attrs['scale_factor'] = var.attrs.get('gain')
var.attrs['add_offset'] = var.attrs.get('offset')
ds = xarray.decode_cf(xds)

This will decode your variable using netcdf mechanisms.

Additionally you could try to give the extracted dimension useful names (you will get something like phony_dim_0, phony_dim_1, ..., phony_dim_N) and assign new (as in example) or existing variables/coordinates to those dimensions to gain as much of the xarray machinery:

var = xds['var']
var.attrs['scale_factor'] = var.attrs.get('gain')
var.attrs['add_offset'] = var.attrs.get('offset')
dims = var.dims
xds['var'] = var.rename({dims[0]: 'x', dims[1]: 'y'})
xds = xds.assign({'x': (['x'], xvals, xattrs)})
xds = xds.assign({'y': (['y'], yvals, yattrs)})
ds = xarray.decode_cf(xds)

References:

Sign up to request clarification or add additional context in comments.

2 Comments

This looks like a good approach, except that I keep getting AttributeError: 'NoneType' object has no attribute 'dimensions' when trying to open my hdf5 file. Was it written in a way that's not compatible with netcdf?
@TomCho Which versions of xarray, hdf5, libnetcd and netCDF4 are you using. And what is the code and the error message?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.