How to implement k-means algorithm on string data

Question

I am trying to implement K-means algorithm on the below data-set.It's stragiht-forward to calculate distance between any two numeric attributes but how do I calculate distance between two strings and also how do I sum up all the distances(i.e the distance between string attributes and the distance between numeric attributes.) Please kindly advise me.Thank you.

Has QUIT--Anony-Mousse · Accepted Answer · 2012-11-23 08:43:32Z

K-means is designed for Euclidean distance. You cannot just plug in arbitrary other distance functions. This may cause k-means to no longer converge.

The required property is that the mean must minimize the variances. If you cannot guarantee this property (and what is the mean of a string anyway?) then you lose guaranteed convergence.

Technically, k-means isn't even based on Euclidean distance, but it minimizes variances, which happen to be the same as squared Euclidean distances; and if you minimize these squares, you also minimize Euclidean distance. But what the algorithm really aims at minimizing is Var(Attribute 1, Cluster 1) + Var(Attribute 2, Cluster 1) + ... + Var(Attribute n, Cluster k).

You might want to look into k-medians, which by using a medoid instead of the mean, avoids both the need to be able to compute a mean and can give convergence guarantees for arbitrary distances as far as I know.

However, you might want to look into truly distance based algorithms, including the various density based clustering algorithms which usually also are distance-based.

stackoverflowuser2010 · Accepted Answer · 2012-11-22 19:16:55Z

0

To calculate the distance between strings, you can use the Levenshtein distance (aka edit distance).

To normalize the values between the string and numeric attributes, you can can try to state the attributes as percentages: find the min and max value of each type of attribute, and then for a given data instance, calculate its percentage within the respective range.

answered Nov 22, 2012 at 19:16

stackoverflowuser2010

41.5k52 gold badges178 silver badges229 bronze badges

7 Comments

Teja Over a year ago

Can you give me an example? For my data if you take first two records v42 and v45 has a distance of 1,p11 and p11 has a distance zero.For 51097 and 260 has a distance of |51097-260| and same with other attr.Now how can I club all the distances to get a final distance.

Tom Morris Over a year ago

You aren't normalizing as they suggested. If the range of the 4th field is 0-373819, your difference is |(51097/373819)-(260/373819)|. I'd also question whether you're asking the right question. Are p* and v* really strings or coded representations of some value? The distance between p14 and p12 could be 1 (the edit distance), 2 (the difference of the numeric values) or some entirely different number which is the difference between the values those two IDs represent.

Nikos Tsirakis Over a year ago

Of course you must use the same distance function for your data. I believe you need in this case the "Levenshtein distance" as @stackoverflowuser2010 said. Also there is the Hamming distance but it works only with the strings of same size.

Adam Over a year ago

This answer makes no sense. Please explain how on earth you would calculate the mean of strings when assigning the centroids at each iteration of the algorithm.

stackoverflowuser2010 Over a year ago

@codesparkle: Where in my answer is the word "mean" mentioned? I looked through your commenting history; you should stick to XML, .net, and other topics in your area of "expertise".

|

Collectives™ on Stack Overflow

How to implement k-means algorithm on string data

2 Answers 2

Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related