I have this piece of code that is called multiple times during the run of the application. It takes an array of numbers which represent values (value_array). These should be summed up in zones, which are defined in the zone_array. zone_ids represents a list of all the possible zones in zone_array.
Its basically something in the lines of: i got a population raster map and i want to know how many people live in each zone of the zone map.
the code:
values = np.zeros(len(zone_ids))
for i in zone_ids:
values[i] = round(np.nansum(value_array[zone_array == i]), 2)
return values
The culprit seems to be the for loop, but i have not found a way to eliminate it and have the same results.
I tried it with bincount but i did not succeed. Using numba jit also has no effect.
I would like to stay away from cython as this code will be used in a Qgis plugin which has no cython support.
test code:
import numpy as np
def fill_values(zone_array, value_array, zone_ids):
values = np.zeros(len(zone_ids))
for i in zone_ids:
values[i] = round(np.nansum(value_array[zone_array == i]), 2)
return values
def run():
# 300 different zones
zone_ids = range(300)
# zone map with 300 zones
zone_array = (np.random.rand(2000, 2000) * 300).astype(int)
# value map from which we want the sum of values per zone (real map can have NaN values)
value_array = (np.random.rand(2000, 2000) * 10.)
value_array[5, 5] = np.NAN
fill_values(zone_array, value_array, zone_ids)
if __name__ == '__main__':
run()
1.92 s ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With the implementation of bincount as suggested by Divakar :
203 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
zone_array==iwithin. All 2000x2000=4e6 values have to be checked for equality toifor each zone_idi.zone_array==ii focus on the loop. The best would be that i could somehow usezone_array == zone_idsand skip the loop.zone_array[:,:,None] == zone_ids, but that still leaves indexing in the for loop and doesn't give much of an improvement in performance.