2

I am trying to implement K-means algorithm on the below data-set.It's stragiht-forward to calculate distance between any two numeric attributes but how do I calculate distance between two strings and also how do I sum up all the distances(i.e the distance between string attributes and the distance between numeric attributes.) Please kindly advise me.Thank you.

2 Answers 2

6

K-means is designed for Euclidean distance. You cannot just plug in arbitrary other distance functions. This may cause k-means to no longer converge.

The required property is that the mean must minimize the variances. If you cannot guarantee this property (and what is the mean of a string anyway?) then you lose guaranteed convergence.

Technically, k-means isn't even based on Euclidean distance, but it minimizes variances, which happen to be the same as squared Euclidean distances; and if you minimize these squares, you also minimize Euclidean distance. But what the algorithm really aims at minimizing is Var(Attribute 1, Cluster 1) + Var(Attribute 2, Cluster 1) + ... + Var(Attribute n, Cluster k).

You might want to look into k-medians, which by using a medoid instead of the mean, avoids both the need to be able to compute a mean and can give convergence guarantees for arbitrary distances as far as I know.

However, you might want to look into truly distance based algorithms, including the various density based clustering algorithms which usually also are distance-based.

Sign up to request clarification or add additional context in comments.

Comments

0

To calculate the distance between strings, you can use the Levenshtein distance (aka edit distance).

To normalize the values between the string and numeric attributes, you can can try to state the attributes as percentages: find the min and max value of each type of attribute, and then for a given data instance, calculate its percentage within the respective range.

7 Comments

Can you give me an example? For my data if you take first two records v42 and v45 has a distance of 1,p11 and p11 has a distance zero.For 51097 and 260 has a distance of |51097-260| and same with other attr.Now how can I club all the distances to get a final distance.
You aren't normalizing as they suggested. If the range of the 4th field is 0-373819, your difference is |(51097/373819)-(260/373819)|. I'd also question whether you're asking the right question. Are p* and v* really strings or coded representations of some value? The distance between p14 and p12 could be 1 (the edit distance), 2 (the difference of the numeric values) or some entirely different number which is the difference between the values those two IDs represent.
Of course you must use the same distance function for your data. I believe you need in this case the "Levenshtein distance" as @stackoverflowuser2010 said. Also there is the Hamming distance but it works only with the strings of same size.
This answer makes no sense. Please explain how on earth you would calculate the mean of strings when assigning the centroids at each iteration of the algorithm.
@codesparkle: Where in my answer is the word "mean" mentioned? I looked through your commenting history; you should stick to XML, .net, and other topics in your area of "expertise".
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.