I am trying to do a simple calculation on a very large netcdf file, and am struggling to speed it up -- probably because I program primarily in julia and R. I think xarray/dask are the best approach for this issue, but I'm struggling to get it to work.
The netcdf file has three dimensions: lat (4320), lon (8640), and time (792). It is a global dataset of precipitation. I want to calculate the yearly mean of precipitation (the time dimension is in month units, so every 12) at a set of 129,966 lat/lon point locations.
Here is what I have as a rough skeleton:
import xarray as xr
import netCDF4 as nc
import numpy as np
import pandas as pd
from dask import delayed, compute
from dask.distributed import Client
import dask
from tqdm import tqdm
num_workers = 10 # Adjust this to the number of cores you want to use
threads_per_worker = 1 # Adjust as needed
client = Client(n_workers = num_workers, threads_per_worker = threads_per_worker, dashboard_address=':8787')
## this is the netcdf file
ds = xr.open_mfdataset(f'../data/terraclim/TerraClimate_{var}_*.nc', chunks='auto')
lon = ds['lon'].values
lat = ds['lat'].values
# Initialize arrays for indices
lonindx = np.empty(coords.shape[0], dtype=int)
latindx = np.empty(coords.shape[0], dtype=int)
# Loop through each row of coords, a pandas dataframe of lat and lon coordinates to extract data from
for i in tqdm(range(coords.shape[0]), desc="Processing"):
lonindx[i] = np.argmin(np.abs(lon - coords.at[i, 'LON']))
latindx[i] = np.argmin(np.abs(lat - coords.at[i, 'LAT']))
## indexes the netcdf file (ds) by lat and lon and returns an array of year-means
def compute_means(i):
vals = ds[var][0:791, latindx[i], lonindx[i]].values
means = [np.mean(year) for year in np.array_split(vals, 66)]
return means
results = []
for i in range(coords.shape[0]):
res = dask.delayed(compute_means)(i)
results.append(res)
results = dask.compute(*results)
When I run the above script, however, it crashes due to memory limitations. I think I may be missing a more appropriate way to use xarray's internal compatability with dask, but I am totally lost. Any help would be greatly appreciated! As would solutions in julia or R.