2

I have a function which implements the k-mean algorithm and I want to use it with DataFrames in order to take into account indexes. For the moment I use DataFrame.values and it works. Yet I don't get the indexes of the output.

def cluster_points(X, mu):
    clusters  = {}
    for x in X:
        bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
                    for i in enumerate(mu)], key=lambda t:t[1])[0]
        try:
            clusters[bestmukey].append(x)
        except KeyError:
            clusters[bestmukey] = [x]
    return clusters

def reevaluate_centers(mu, clusters):
    newmu = []
    keys = sorted(clusters.keys())
    for k in keys:
        newmu.append(np.mean(clusters[k], axis = 0))
    return newmu

def has_converged(mu, oldmu):
    return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))


def find_centers(X, K):
    # Initialize to K random centers
    oldmu = random.sample(X, K)
    mu = random.sample(X, K)
    while not has_converged(mu, oldmu):
        oldmu = mu
        # Assign all points in X to clusters
        clusters = cluster_points(X, mu)
        # Reevaluate centers
        mu = reevaluate_centers(oldmu, clusters)
    return(mu, clusters)

For instance with thus example minimal and sufficient :

import itertools

df = pd.DataFrame(np.random.randint(0,10,size=(10, 5)), index = list(range(10)), columns=list(range(5)))
df.index.name = 'subscriber_id'
df.columns.name = 'ad_id'

I get :

find_centers(df.values, 2)
([array([ 3.8,  3. ,  3.6,  2. ,  3.6]),
  array([ 6.8,  3.6,  5.6,  6.8,  6.8])],
 {0: [array([2, 0, 5, 6, 4]),
   array([1, 1, 2, 3, 3]),
   array([6, 0, 4, 0, 3]),
   array([7, 9, 4, 1, 7]),
   array([3, 5, 3, 0, 1])],
  1: [array([6, 2, 5, 9, 6]),
   array([8, 9, 7, 2, 8]),
   array([7, 5, 3, 7, 8]),
   array([7, 1, 5, 7, 6]),
   array([6, 1, 8, 9, 6])]})

I have the values but don't have the indexes.

1 Answer 1

1

If you want to get the array of values including the index, you can simply add the index to the columns with reset_index():

values_with_index = df.reset_index().values

Update

If what you want is to have the index on the output, but not use it during the actual clustering, you can do the following. First, pass the actual data frame object to find_centers:

find_centers(df, 2)

Then change cluster_points as follows:

def cluster_points(X, mu):
    clusters  = {}
    for _, x in X.iterrows():
        bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]]))
                         for i in enumerate(mu)], key=lambda t:t[1])[0]
        # You can replace this try/except block with
        # clusters.setdefault(bestmukey, []).append(x)
        try:
            clusters[bestmukey].append(x)
        except KeyError:
            clusters[bestmukey] = [x]
    return clusters

The centers in the output will still be arrays, but the clusters will contain series objects with each row. The name property of each of these series is the index value in the data frame.

Sign up to request clarification or add additional context in comments.

2 Comments

The OP probably means after applying his function find_centers
@Marine1 You're probably right, I was confused by the "in order to take into account indexes" part, but that makes more sense... I've updated the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.