Writing a dask array to netcdf

Question

I am trying to write a dask array into a netcdf file and I am getting a memory error, which I find a little strange as the the size of the dask array isn't too big. It's about 0.04 GB. It's dimension are as following:

    <xarray.Dataset>
    Dimensions:    (latitude: 2000, longitude: 5143)
    Coordinates:

    * latitude   (latitude) float64 -29.98 -29.93 -29.88 -29.83 -29.78 -29.73 ...

    * longitude  (longitude) float64 -180.0 -179.9 -179.9 -179.8 -179.8 -179.7 ...

    Data variables:Tmax   (latitude, longitude) float32 

    **dask.array shape=(2000, 5143), chunksize=(2000, 5143)**

I also tried rechunking and that doesn't help either. Please let me know if you have any tips. Thank you!

Here is how I am generating the dask array to be written to netcdf.

    DATA = xr.open_mfdataset(INFILE, concat_dim='Month', autoclose=True)
    MONTH_MEAN = DATA.mean(dim='Month')

    DIFF_MEAN = ((MONTH_MEAN.isel(time=np.arange(17, 34))).mean(dim='time') -
    (MONTH_MEAN.isel(time=np.arange(17)))).mean(dim='time')

    OUTFILE = OUTFILE_template.format(CHIRTS_INDIR, DATA_LIST[c])

    DIFF_MEAN.to_netcdf(OUTFILE)

The dimension of DATA the original dask array that contains data from all the input files, is:

   <xarray.Dataset>

   Dimensions:    (Month: 12, latitude: 2000, longitude: 5143, time: 34)

   Coordinates:
   * latitude   (latitude) float64 -29.98 -29.93 -29.88 -29.83 -29.78 -29.73 ...
   * longitude  (longitude) float64 -180.0 -179.9 -179.9 -179.8 -179.8 -179.7 ...
   * time       (time) datetime64[ns] 1983-12-31 1984-12-31 1985-12-31 ...

   Dimensions without coordinates: Month

   Data variables:
   Tmax       (Month, time, latitude, longitude) 
   float32 dask.array<shape=(12, 34, 2000, 5143), 
   chunksize=(1, 34, 2000, 5143)>

Can you add details on how you make this dask array? It's quite likely that you are running out of memory for some intermediate operation rather than calculating the final dask array. — shoyer
– shoyer, Commented May 22, 2018 at 16:40
I just added more details on the operations I am doing to create that dask array. You make a good point. I am reading some large datasets, and calculating annual averages using the first 17 years and then the last 17 years, finally calculating the mean between the two which results in the dask array, I'd like to write in netcdf. — Shrad
– Shrad, Commented May 22, 2018 at 20:26

shoyer · Accepted Answer · 2018-05-22 21:14:33Z

1

Each array chunk in your dataset contains 34*2000*5143*4/1e9 = 1.4 GB of data. This is rather large for working with dask arrays. As a rule of thumb, you want to be able to store something like 5-10 array chunks in memory at once, per CPU core. Smaller chunks (~100 MB) would probably speed up your computation and also reduce memory requirements. See here for more guidelines on chunk sizes.

To adjust chunk sizes with xarray/dask, use the chunks= argument in open_mfdataset, e.g.,

DATA = xr.open_mfdataset(INFILE, concat_dim='Month', autoclose=True,
                         chunks={'latitude': 1000, 'longitude': 1029})

answered May 22, 2018 at 21:14

shoyer

9,6831 gold badge41 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Shrad Over a year ago

Great. Thanks very much! I actually just experimented with chunks={time:1}, and that alone helped a lot, I'll check now how lat and long chunks can help reduce the computational time.

Collectives™ on Stack Overflow

Writing a dask array to netcdf

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related