Getting the indexes of a Dataframe after a numpy array function

Question

I have a function which implements the k-mean algorithm and I want to use it with DataFrames in order to take into account indexes. For the moment I use DataFrame.values and it works. Yet I don't get the indexes of the output.

def cluster_points(X, mu):
    clusters  = {}
    for x in X:
        bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
                    for i in enumerate(mu)], key=lambda t:t[1])[0]
        try:
            clusters[bestmukey].append(x)
        except KeyError:
            clusters[bestmukey] = [x]
    return clusters

def reevaluate_centers(mu, clusters):
    newmu = []
    keys = sorted(clusters.keys())
    for k in keys:
        newmu.append(np.mean(clusters[k], axis = 0))
    return newmu

def has_converged(mu, oldmu):
    return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))


def find_centers(X, K):
    # Initialize to K random centers
    oldmu = random.sample(X, K)
    mu = random.sample(X, K)
    while not has_converged(mu, oldmu):
        oldmu = mu
        # Assign all points in X to clusters
        clusters = cluster_points(X, mu)
        # Reevaluate centers
        mu = reevaluate_centers(oldmu, clusters)
    return(mu, clusters)

For instance with thus example minimal and sufficient :

import itertools

df = pd.DataFrame(np.random.randint(0,10,size=(10, 5)), index = list(range(10)), columns=list(range(5)))
df.index.name = 'subscriber_id'
df.columns.name = 'ad_id'

I get :

find_centers(df.values, 2)
([array([ 3.8,  3. ,  3.6,  2. ,  3.6]),
  array([ 6.8,  3.6,  5.6,  6.8,  6.8])],
 {0: [array([2, 0, 5, 6, 4]),
   array([1, 1, 2, 3, 3]),
   array([6, 0, 4, 0, 3]),
   array([7, 9, 4, 1, 7]),
   array([3, 5, 3, 0, 1])],
  1: [array([6, 2, 5, 9, 6]),
   array([8, 9, 7, 2, 8]),
   array([7, 5, 3, 7, 8]),
   array([7, 1, 5, 7, 6]),
   array([6, 1, 8, 9, 6])]})

I have the values but don't have the indexes.

javidcf · Accepted Answer · 2017-06-29 17:07:21Z

1

If you want to get the array of values including the index, you can simply add the index to the columns with reset_index():

values_with_index = df.reset_index().values

Update

If what you want is to have the index on the output, but not use it during the actual clustering, you can do the following. First, pass the actual data frame object to find_centers:

find_centers(df, 2)

Then change cluster_points as follows:

def cluster_points(X, mu):
    clusters  = {}
    for _, x in X.iterrows():
        bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]]))
                         for i in enumerate(mu)], key=lambda t:t[1])[0]
        # You can replace this try/except block with
        # clusters.setdefault(bestmukey, []).append(x)
        try:
            clusters[bestmukey].append(x)
        except KeyError:
            clusters[bestmukey] = [x]
    return clusters

The centers in the output will still be arrays, but the clusters will contain series objects with each row. The name property of each of these series is the index value in the data frame.

edited Jun 29, 2017 at 17:07

answered Jun 29, 2017 at 15:37

javidcf

59.9k7 gold badges87 silver badges134 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Revolucion for Monica Over a year ago

The OP probably means after applying his function find_centers

javidcf Over a year ago

@Marine1 You're probably right, I was confused by the "in order to take into account indexes" part, but that makes more sense... I've updated the answer.

Collectives™ on Stack Overflow

Getting the indexes of a Dataframe after a numpy array function

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related