1

I am trying to write a dask array into a netcdf file and I am getting a memory error, which I find a little strange as the the size of the dask array isn't too big. It's about 0.04 GB. It's dimension are as following:

    <xarray.Dataset>
    Dimensions:    (latitude: 2000, longitude: 5143)
    Coordinates:

    * latitude   (latitude) float64 -29.98 -29.93 -29.88 -29.83 -29.78 -29.73 ...

    * longitude  (longitude) float64 -180.0 -179.9 -179.9 -179.8 -179.8 -179.7 ...

    Data variables:Tmax   (latitude, longitude) float32 

    **dask.array shape=(2000, 5143), chunksize=(2000, 5143)**

I also tried rechunking and that doesn't help either. Please let me know if you have any tips. Thank you!

Here is how I am generating the dask array to be written to netcdf.

    DATA = xr.open_mfdataset(INFILE, concat_dim='Month', autoclose=True)
    MONTH_MEAN = DATA.mean(dim='Month')

    DIFF_MEAN = ((MONTH_MEAN.isel(time=np.arange(17, 34))).mean(dim='time') -
    (MONTH_MEAN.isel(time=np.arange(17)))).mean(dim='time')

    OUTFILE = OUTFILE_template.format(CHIRTS_INDIR, DATA_LIST[c])

    DIFF_MEAN.to_netcdf(OUTFILE) 

The dimension of DATA the original dask array that contains data from all the input files, is:

   <xarray.Dataset>

   Dimensions:    (Month: 12, latitude: 2000, longitude: 5143, time: 34)

   Coordinates:
   * latitude   (latitude) float64 -29.98 -29.93 -29.88 -29.83 -29.78 -29.73 ...
   * longitude  (longitude) float64 -180.0 -179.9 -179.9 -179.8 -179.8 -179.7 ...
   * time       (time) datetime64[ns] 1983-12-31 1984-12-31 1985-12-31 ...

   Dimensions without coordinates: Month

   Data variables:
   Tmax       (Month, time, latitude, longitude) 
   float32 dask.array<shape=(12, 34, 2000, 5143), 
   chunksize=(1, 34, 2000, 5143)>
2
  • Can you add details on how you make this dask array? It's quite likely that you are running out of memory for some intermediate operation rather than calculating the final dask array. Commented May 22, 2018 at 16:40
  • I just added more details on the operations I am doing to create that dask array. You make a good point. I am reading some large datasets, and calculating annual averages using the first 17 years and then the last 17 years, finally calculating the mean between the two which results in the dask array, I'd like to write in netcdf. Commented May 22, 2018 at 20:26

1 Answer 1

1

Each array chunk in your dataset contains 34*2000*5143*4/1e9 = 1.4 GB of data. This is rather large for working with dask arrays. As a rule of thumb, you want to be able to store something like 5-10 array chunks in memory at once, per CPU core. Smaller chunks (~100 MB) would probably speed up your computation and also reduce memory requirements. See here for more guidelines on chunk sizes.

To adjust chunk sizes with xarray/dask, use the chunks= argument in open_mfdataset, e.g.,

DATA = xr.open_mfdataset(INFILE, concat_dim='Month', autoclose=True,
                         chunks={'latitude': 1000, 'longitude': 1029})
Sign up to request clarification or add additional context in comments.

1 Comment

Great. Thanks very much! I actually just experimented with chunks={time:1}, and that alone helped a lot, I'll check now how lat and long chunks can help reduce the computational time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.