Here are some alternative clustering (binning) methods.
I assumed that the line mini = min(float(s) for s in diff) and the following code was indented incorrectly. If the indent was correct then that is your problem. If not, here are three implementations that might speed up the work:
Method #1:
Here is an implementation of the original code with redundancies removed and a comprehension for the inner loop:
def method1():
K = [[A[0, 0]]]
for element in A.flatten():
min_diff = min((abs(element - l[0]), i) for i, l in enumerate(K))
if min_diff[0] < Theta:
K[min_diff[1]].append(element)
else:
K.append([element])
Method #2:
This is bit more complex, in that it keeps a sorted list of the first elements, and for each element first does an efficient lookup to find the two closest first elements, and then performs the distance/Theta calc against only the two closest.
import bisect
def method2():
global A
A = np.random.random(A.shape)
data_iter = iter(A.flatten())
K = [[next(data_iter)]]
first_elements = [K[0][0]]
for element in data_iter:
x = bisect.bisect_left(first_elements, element)
if x == 0:
# below lowest value
min_diff = abs(element - first_elements[0]), x
elif x == len(first_elements):
# above highest value
min_diff = abs(element - first_elements[-1]), x-1
else:
min_diff = min((abs(element - l[0]), i)
for i, l in enumerate(K[x-1:x+1]))
if min_diff[0] < Theta:
K[min_diff[1]].append(element)
else:
first_elements.insert(x, element)
K.insert(x, [element])
Method #3:
I was not quite sure why you are using the first element found outside of the range of theta to define the grouping center, so I tried a small experiment. This sorts the data, and then builds clusters that are built around 90% of theta above the minimum value in each cluster. So the results will not match the other methods, but it is quite a bit faster.
def method3():
global A
A = np.random.random(A.shape)
K = []
target = -1E39
for element in np.sort(A.flatten()):
if element - target > Theta:
target = element + Theta * 0.9
K.append([])
K[-1].append(element)
Timing Test Code:
import numpy as np
A = np.random.random((100, 100))
# Threshold value
Theta = 0.03
from timeit import timeit
def runit(stmt):
print("%s: %.3f" % (
stmt, timeit(stmt + '()', number=10,
setup='from __main__ import ' + stmt)))
runit('method0')
runit('method1')
runit('method2')
runit('method3')
Run Times
4 Methods:
0. Original
1. Original recast to more Pythonic
2. Use a search to minimize the number of comparisons
3. Sort the values, and then cluster
For the run times, I had to decrease Theta. With the original theta there are only very small number of clusters with the data given. If the number of clusters is small method #2 is not a lot faster. But as the number of clusters goes up method #2 will get much faster than methods #0 and #1. Method #3 (the sort) is fastest by far.
method0: 5.641
method1: 4.430
method2: 0.836
method3: 0.057
(Seconds, small numbers are better)
{}button at the top of the editor it will save the formatting by indenting by 4 spaces. \$\endgroup\$