Efficient way to sort Matrix elements into separate bins in Python

Question

I'm sorting matrix elements into separate bins.

The approach is: If the difference of each element and the 1st element in existing list is smallest, then add it to the current list. If not, create a new list and add the element there.

Following is my code which I'm applying on a small 3X3 test matrix of random numbers, as is evident, I'm using multiple for loops and and if staements to achieve this and hence it isn't memory efficient and is consuming a lot of time to execute. Is there a better way to handle this in python? Please recommend.

import numpy as np 
import sys
import matplotlib.pyplot as plt

A = np.random.random((3, 3))

#Initialize the first cluster
k1 = [A[0,0]]
#Initialize the set of clusters
K = [k1]
#Threshold value
Theta = 0.3

counter = 1

for i in range(A.shape[1]):
    for j in range(A.shape[0]):
        diff = []
        for l in K:
            diff_l = abs(A[i,j] - l[0])
            diff.append(diff_l)
            mini = min(float(s) for s in diff)
            x = diff.index(min(diff))
            if mini < Theta:
                K[x].append(A[i,j])
            else:
                counter = counter+1
                new_list = 'k'+str(counter)
                new_list= []
                new_list.append(A[i,j])
                K.append(new_list)

print len(K)

It would very beneficial if you could show how you are doing it now. — Stephen Rauch
– Stephen Rauch, Commented Jun 7, 2017 at 17:41
Sure, following is the code. I'm currently applying this on a text matrix of random numbers I created. — Sneha
– Sneha, Commented Jun 7, 2017 at 21:21
A = np.random.random((3, 3)) #Initialize the first cluster k1 = [A[0,0]] #Initialize the set of clusters K = [k1] #Threshold value Theta = 0.3 counter = 1 for i in range(A.shape[1]): for j in range(A.shape[0]): diff = [] for l in K: diff_l = abs(A[i,j] - l[0]) diff.append(diff_l) mini = min(float(s) for s in diff) x = diff.index(min(diff)) if mini < Theta: K[x].append(A[i,j]) else: counter = counter+1 new_list = 'k'+str(counter) new_list= [] new_list.append(A[i,j]) K.append(new_list) — Sneha
– Sneha, Commented Jun 7, 2017 at 21:21
It is best to edit this into the question. Python especially suffers from lack of indenting. If you highlight the code and the press the {} button at the top of the editor it will save the formatting by indenting by 4 spaces. — Stephen Rauch
– Stephen Rauch, Commented Jun 7, 2017 at 21:23
Thanks Stephen, I have edited the question as per your recommendation. — Sneha
– Sneha, Commented Jun 7, 2017 at 21:36

Stephen Rauch · Accepted Answer · 2017-06-08 00:50:27Z

Here are some alternative clustering (binning) methods.

I assumed that the line mini = min(float(s) for s in diff) and the following code was indented incorrectly. If the indent was correct then that is your problem. If not, here are three implementations that might speed up the work:

Method #1:

Here is an implementation of the original code with redundancies removed and a comprehension for the inner loop:

def method1():
    K = [[A[0, 0]]]
    for element in A.flatten():
        min_diff = min((abs(element - l[0]), i) for i, l in enumerate(K))
        if min_diff[0] < Theta:
            K[min_diff[1]].append(element)
        else:
            K.append([element])

Method #2:

This is bit more complex, in that it keeps a sorted list of the first elements, and for each element first does an efficient lookup to find the two closest first elements, and then performs the distance/Theta calc against only the two closest.

import bisect
def method2():
    global A
    A = np.random.random(A.shape)
    data_iter = iter(A.flatten())
    K = [[next(data_iter)]]
    first_elements = [K[0][0]]
    for element in data_iter:
        x = bisect.bisect_left(first_elements, element)
        if x == 0:
            # below lowest value
            min_diff = abs(element - first_elements[0]), x
        elif x == len(first_elements):
            # above highest value
            min_diff = abs(element - first_elements[-1]), x-1
        else:
            min_diff = min((abs(element - l[0]), i)
                           for i, l in enumerate(K[x-1:x+1]))
        if min_diff[0] < Theta:
            K[min_diff[1]].append(element)
        else:
            first_elements.insert(x, element)
            K.insert(x, [element])

Method #3:

I was not quite sure why you are using the first element found outside of the range of theta to define the grouping center, so I tried a small experiment. This sorts the data, and then builds clusters that are built around 90% of theta above the minimum value in each cluster. So the results will not match the other methods, but it is quite a bit faster.

def method3():
    global A
    A = np.random.random(A.shape)
    K = []
    target = -1E39
    for element in np.sort(A.flatten()):
        if element - target > Theta:
            target = element + Theta * 0.9
            K.append([])
        K[-1].append(element)

Timing Test Code:

import numpy as np
A = np.random.random((100, 100))

# Threshold value
Theta = 0.03

from timeit import timeit

def runit(stmt):
    print("%s: %.3f" % (
        stmt, timeit(stmt + '()', number=10,
                     setup='from __main__ import ' + stmt)))

runit('method0')
runit('method1')
runit('method2')
runit('method3')

Run Times

4 Methods:

0. Original
1. Original recast to more Pythonic
2. Use a search to minimize the number of comparisons
3. Sort the values, and then cluster

For the run times, I had to decrease Theta. With the original theta there are only very small number of clusters with the data given. If the number of clusters is small method #2 is not a lot faster. But as the number of clusters goes up method #2 will get much faster than methods #0 and #1. Method #3 (the sort) is fastest by far.

method0: 5.641
method1: 4.430
method2: 0.836
method3: 0.057

(Seconds, small numbers are better)

Thank you very much Stephen, really appreciate your time. Method 1 significantly improved the response time. Will try method 2 and 3. — Sneha
– Sneha, Commented Jun 8, 2017 at 19:23

Stack Exchange Network

Efficient way to sort Matrix elements into separate bins in Python

1 Answer 1

Method #1:

Method #2:

Method #3:

Timing Test Code:

Run Times

You must log in to answer this question.

Hot Network Questions

Efficient way to sort Matrix elements into separate bins in Python

1 Answer 1

Method #1:

Method #2:

Method #3:

Timing Test Code:

Run Times

You must log in to answer this question.

Related

Hot Network Questions