I am trying to implement K-means algorithm on the below data-set.It's stragiht-forward to calculate distance between any two numeric attributes but how do I calculate distance between two strings and also how do I sum up all the distances(i.e the distance between string attributes and the distance between numeric attributes.) Please kindly advise me.Thank you.
2 Answers
K-means is designed for Euclidean distance. You cannot just plug in arbitrary other distance functions. This may cause k-means to no longer converge.
The required property is that the mean must minimize the variances. If you cannot guarantee this property (and what is the mean of a string anyway?) then you lose guaranteed convergence.
Technically, k-means isn't even based on Euclidean distance, but it minimizes variances, which happen to be the same as squared Euclidean distances; and if you minimize these squares, you also minimize Euclidean distance. But what the algorithm really aims at minimizing is Var(Attribute 1, Cluster 1) + Var(Attribute 2, Cluster 1) + ... + Var(Attribute n, Cluster k).
You might want to look into k-medians, which by using a medoid instead of the mean, avoids both the need to be able to compute a mean and can give convergence guarantees for arbitrary distances as far as I know.
However, you might want to look into truly distance based algorithms, including the various density based clustering algorithms which usually also are distance-based.
Comments
To calculate the distance between strings, you can use the Levenshtein distance (aka edit distance).
To normalize the values between the string and numeric attributes, you can can try to state the attributes as percentages: find the min and max value of each type of attribute, and then for a given data instance, calculate its percentage within the respective range.