How to count the sum of at least two consecutive days with same the values, from xarray dataset in python

Question

I have an Xarray dataset with 3 dimensions (time, lon, lat) and a heat variable. The heat takes the values 1 or 0 (1 = more than 30degC, 0 = less than 30degC), and I want to calculate the number of events with at least three consecutive days with value 1 (time is in daily step)

I want to create an xarray dataset with only one-time value (one day) where each pixel will contain the number of events with at least three consecutive days with value 1.

time 155 = all July days for years 2026-2030

xarray.DataArray 'heat' time: 155, rlat: 6, rlon: 4

Do you have some ideas on how to do this?

here you can find the netcdf data EC7_heatjulZA26_30.nc

tm = 'C:/Path_to_NC/EC7_heatjulZA26_30.nc'
dmax = xr.open_dataset(tm, decode_coords="all")

da_max = dmax['tasmax']
da_max

*xarray.DataArray'tasmax'time: 155, rlat: 6, rlon: 4*

I sum (the 1 and 0) the data to one 6 x 4 raster, but I want to sum only those which are consecutively behind each other for at least 3 days.

HW=np.sum(da_max,axis = 0)
HW

array([[24, 20, 14, 11],
       [26, 21, 15, 11],
       [23, 23, 15, 12],
       [14, 15, 14, 11],
       [12, 12, 14, 11],
       [14, 12, 15, 13]])

Could you provide some original dataset and the result. Hard to answer the question with code. — Kaison L
– Kaison L, Commented Jun 29, 2023 at 12:31
It is very difficult to answer your question without seeing any of your data nor any of the solution you have written which produces your problem. Please edit your question to show a minimal reproducible set consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See Minimal Reproducible Example for details on how to best help us help you. — itprorh66
– itprorh66, Commented Jun 29, 2023 at 13:51
Could you somehow use differentiation along time axis and original values? Basically, if the np.diff is 0 while the original values are one, there were consecutive time moments with larger than 30 degree values... — msi_gerva
– msi_gerva, Commented Jun 29, 2023 at 13:57
Are there four axes, and a fifth value? Or three axes, with heat the fourth dimension and being the (0-1) value? Having heat as an axis and as a value is awkward to me. — 9769953
– 9769953, Commented Jun 30, 2023 at 8:04
Ok, I added the file for download, the NC has 3 dimensions (time, rlat, rlon ) and 1 variable (heat). — Martin
– Martin, Commented Jun 30, 2023 at 10:11

ClimateUnboxed · Accepted Answer · 2025-05-21 14:13:03Z

An easy logic is to do a runsum on the binary field using the rolling sum function with a window size equal to your event length N, identify the Ns with a logical expression and then just add those up. Thus you can do it in these two lines:

event_len=3 # this is N
ds_sum = ds.rolling(time=event_len, center=False).sum()
events = (ds_sum == event_len).astype(int).sum(dim="time")

If this logic is unclear, I actually have a youtube video posted on this on my channel for further clarification.

However, as pointed out in the comment by @Patrick, while this works fine for continuous data, it has a minor error if your data is only for specific months, e.g. July, as you may get a heatwave that consists of the last day of July in year X and then the first two days of July in year X+1 for example!!! To avoid this, we want to effectively turn the 3D [lon,lat,time] array into a 4D [lon,lat,year,day] array and apply the rolling function on the last "day" time dimension for each year separately. Luckily for us, there is the xarray function groupby which is ideal for this and we then just use apply like this.

events_per_year = ds['data'].groupby(ds.time.dt.year).apply(
    lambda x: (x.rolling(time=event_len, center=False).sum() == event_len).astype(int).sum(dim="time")
)

I made some dummy July data for 4 years up to illustrate this for a small 3x3 grid (so this is reproducible after the OPs netcdf link disappears in the future):

import xarray as xr
import numpy as np
import pandas as pd

# make up some dummy July data for a few year on a small grid:
time = pd.date_range(start='2010-07-01', end='2013-07-31', freq='D')
time = time[time.month == 7]
lon = np.arange(0, 3, 1)
lat = np.arange(0, 3, 1)
data = np.random.randint(0, 2, size=(len(time), len(lat), len(lon)))
ds = xr.Dataset(
    {"data": (("time", "lat", "lon"), data)},
    coords={"time": time, "lon": lon, "lat": lat},)

### now we define events of this length:

event_len=3

# this function wraps up 3 steps:
# 1. Calculate the rolling sum for 'event_len' days for each year
# 2. Check where this rolling sum equals 'event_len' which I put 3 days here as per the OP def 
# 3. Sum the true/false 1/0 in time dim
events_per_year = ds['data'].groupby(ds.time.dt.year).apply(
    lambda x: (x.rolling(time=event_len, center=False).sum() == event_len).astype(int).sum(dim="time")
)

print(events_per_year)

This gives the following output

<xarray.DataArray (year: 4, lat: 3, lon: 3)>
array([[[3, 2, 2],
        [0, 6, 8],
        [3, 5, 4]],

       [[8, 2, 8],
        [4, 5, 6],
        [7, 4, 4]],

       [[9, 3, 1],
        [6, 4, 3],
        [2, 2, 7]],

       [[1, 3, 3],
        [2, 6, 8],
        [3, 2, 0]]])
Coordinates:
  * lat      (lat) int64 0 1 2
  * lon      (lon) int64 0 1 2
  * year     (year) int64 2010 2011 2012 2013

Neat solution but it does not account for the discontinuity on the "time" axis in the netCDF file of the OP: 2026-07-31 is followed by 2027-07-01
@Patrick Good point, but nothing that a little bit of groupby/apply magic can't handle. :-) Code updated to account for this, thanks for the input!

msi_gerva · Accepted Answer · 2023-07-03 09:36:03Z

0

I can propose following solution:

#!/usr/bin/env ipython
# ---------------------
import numpy as np
import xarray as xr
from pylab import pcolormesh,show,colorbar,plot,title,legend,subplot,savefig
# -------------------
fin = 'EC7_heatjulZA26_30.nc' # filename in...

dfin = xr.open_dataset(fin) # let us use xarray to read data ...
vin = dfin['tasmax'].values # get only the values ...
ntime,ny,nx = np.shape(vin) # get the dimensions...
# --------------------------------------
dvin = np.diff(vin,axis=0) # let us take the diff along time axis...
# --------------------------------------
counts = np.sum(vin,axis=0) # let us count days with temp over threshold...
pcolormesh(counts);colorbar();title('Number of days over threshold');show() # let us view the map...
# --------------------------------------
dvin[dvin<0] = 0.e0; # let us remove the -1 when the temperature has dropped below the treshold...
icounts = np.sum(dvin,axis=0)
pcolormesh(icounts);colorbar();title('Number of instances over threshold (simple)');savefig('simple.png',bbox_inches='tight');show() # let us view the map...
# let us check:
plot(vin[:,1,0]);title('Number of instances found '+str(icounts[1,0]));show() # if you check, the number of instances is too high -- 9 instead of 6
# ---------------------------------------
# let us calculate correct one:
ntime,ny,nx = np.shape(vin) # get the dimensions...
dvin_org = np.diff(vin,axis=0); # the diff...

dvin = np.concatenate((np.zeros((1,ny,nx)),dvin_org),axis=0); # make the diff and original data same size...
dvin_n = np.concatenate((np.zeros((2,ny,nx)),dvin_org[:-1,:,:]),axis=0); # shift the diff +1 index in time
# ------------------------------------------------------------------------------------------------------
dvin[np.where((dvin_n==1)&(vin==1))] = 1.0 # in original diff, add one to the location, where the derivate does not change -- these are the time moments we actually wanted...
# -------------------------------------------------------
icounts = np.sum(dvin,axis=0) # calculate the correct number of instances over treshold
pcolormesh(icounts);colorbar();title('Number of instances over threshold (complex)');savefig('corrected.png',bbox_inches='tight');show() # let us view the map...
# let us check:
plot(vin[:,2,2]);title('Number of instances found '+str(icounts[2,2]));show()

So, the original calculation gives figure like this:

where the occurrences goes up to 9 times in one grid cell. But, this is overestimated as the time-series look like this (9 is taken from the simple solution):

Problem is that we counted also couple of events where the temperature went over threshold only for one day. Therefore, made additional check with using the shifted derivate, see the code up.

In any case, the final solution comes then like this:

Hope this helps!

answered Jul 3, 2023 at 9:36

msi_gerva

2,0803 gold badges22 silver badges31 bronze badges

4 Comments

Martin Over a year ago

Thank you, it is working, I just do not understand two steps: 1. dvin[dvin<0] = 0.e0; why do you delete the values lower than 0, when data consist of 1 and 0, and 2. If I wanna not just 2 consecutive days with value 1 but 3 consecutive days I need to add: dvin_n3 = np.concatenate((np.zeros((3,ny,nx)),dvin_org[:-2,:,:]),axis=0); and final sum will be dvin[np.where((dvin_n==1)&(dvin==1)&(dvin_n3==1))] ..... ?

Martin Over a year ago

Is there a way how to specify the window of three consecutive days with value 1, with np.where ?

msi_gerva Over a year ago

Glad that it works... I deleted the -1 in the simple solution as the derivative will go negative as soon as the temperature drops below the threshold. Regarding 3 days, I think you can somehow use the derivative with shift +2 index, but not sure yet how to implement... :)

Patrick May 15 at 8:18

This solution does not account for the discontinuity on the "time" axis in the netCDF file of the OP: 2026-07-31 is followed by 2027-07-01

Collectives™ on Stack Overflow

How to count the sum of at least two consecutive days with same the values, from xarray dataset in python

2 Answers 2

2 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related