xarray open_mfdataset does not return arrays with Numpy data

Question

I am currently trying to open multiple netCDF files. They all have same main dimension (which is just the number of rows), and multiple variables : time, platform_code, and other ones.

Here is the code I use for trying to concatenate all the datas:

ds_disk_merged = xarray.open_mfdataset([path1, path2, path3, path4], concat_dim="row", combine='nested')

When I take rows, everything is alright: I find my numpy array, concatenated as expected:

In [5]: ds_disk_merged.row.data
Out[5]: array([     0,      1,      2, ..., 968041, 968042, 968043])

But when I take one of my variables, nothing is accessible:

In [6]: ds_disk_merged.time.data
Out[6]: dask.array<concatenate, shape=(968044,), dtype=datetime64[ns], chunksize=(253158,), chunktype=numpy.ndarray>

Do you know how to have all variables datas concatenated, following the same process as my rows do ?

As information, the number of rows for my files (path by path) are like this:

In [7]: all_nc_files_number_of_rows 
Out[7]: [249499, 232995, 232392,253158]

Great question! I clarified the question topic a bit in the title so others might find this question. — Michael Delgado
– Michael Delgado, Commented May 9, 2022 at 18:23

Michael Delgado · Accepted Answer · 2022-05-09 18:33:14Z

3

What you have is a dask.array, which is a chunked, scheduled (but not in-memory) set of multiple numpy arrays. In addition to providing a labeled indexing interface to arrays, xarray has the ability to work with multiple backends, which form the computational engine underlying the array operations on the .data attribute. When you use xr.open_mfdataset the result will always be a chunked dask array. See the xarray docs on Parallel Computing with Dask for more info.

You can just convert to numpy with ds_disk_merged = ds_disk_merged.compute(). Note that the work of reading in the netCDF data will not occur until you trigger a computation like this - until then dask will only schedule the operation. Because of this, you may only run into issues such as read errors, memory bottlenecks, or other workflow issues when executing the job rather than with the line of code that actually is causing the problem. See the dask docs on lazy execution for an intro to this concept.

For starters, check the size of the array with ds_disk_merged[variable_name].data.nbytes and make sure you can fit it comfortably in memory before calling compute().

edited May 9, 2022 at 18:33

answered May 9, 2022 at 18:21

Michael Delgado

15.7k4 gold badges39 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Greg Madman Over a year ago

Thanks a lot for answering, it is very nice ! Everything works. I have to admit that I did not completely understand how dask arrays work. Now it is much more better ! Have a nice day :)

Collectives™ on Stack Overflow

xarray open_mfdataset does not return arrays with Numpy data

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related