1

I am currently trying to open multiple netCDF files. They all have same main dimension (which is just the number of rows), and multiple variables : time, platform_code, and other ones.

Here is the code I use for trying to concatenate all the datas:

ds_disk_merged = xarray.open_mfdataset([path1, path2, path3, path4], concat_dim="row", combine='nested')

When I take rows, everything is alright: I find my numpy array, concatenated as expected:

In [5]: ds_disk_merged.row.data
Out[5]: array([     0,      1,      2, ..., 968041, 968042, 968043])

But when I take one of my variables, nothing is accessible:

In [6]: ds_disk_merged.time.data
Out[6]: dask.array<concatenate, shape=(968044,), dtype=datetime64[ns], chunksize=(253158,), chunktype=numpy.ndarray>

Do you know how to have all variables datas concatenated, following the same process as my rows do ?

As information, the number of rows for my files (path by path) are like this:

In [7]: all_nc_files_number_of_rows 
Out[7]: [249499, 232995, 232392,253158]
1
  • Great question! I clarified the question topic a bit in the title so others might find this question. Commented May 9, 2022 at 18:23

1 Answer 1

3

What you have is a dask.array, which is a chunked, scheduled (but not in-memory) set of multiple numpy arrays. In addition to providing a labeled indexing interface to arrays, xarray has the ability to work with multiple backends, which form the computational engine underlying the array operations on the .data attribute. When you use xr.open_mfdataset the result will always be a chunked dask array. See the xarray docs on Parallel Computing with Dask for more info.

You can just convert to numpy with ds_disk_merged = ds_disk_merged.compute(). Note that the work of reading in the netCDF data will not occur until you trigger a computation like this - until then dask will only schedule the operation. Because of this, you may only run into issues such as read errors, memory bottlenecks, or other workflow issues when executing the job rather than with the line of code that actually is causing the problem. See the dask docs on lazy execution for an intro to this concept.

For starters, check the size of the array with ds_disk_merged[variable_name].data.nbytes and make sure you can fit it comfortably in memory before calling compute().

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot for answering, it is very nice ! Everything works. I have to admit that I did not completely understand how dask arrays work. Now it is much more better ! Have a nice day :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.