Can I use K-means algorithm on a string?

Question

I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation.

I was thinking of using the k-means algorithm but I am not sure how to use it with strings. I found scipy.cluster.vq but I don't know how to use it in my case.

thanks!

Community · Accepted Answer · 2017-04-13 12:44:13Z

One problem you would face if using scipy.cluster.vq.kmeans is that that function uses Euclidean distance to measure closeness. To shoe-horn your problem into one solveable by k-means clustering, you'd have to find a way to convert your strings into numerical vectors and be able to justify using Euclidean distance as a reasonable measure of closeness.

That seems... difficult. Perhaps you are looking for Levenshtein distance instead?

Note there are variants of the K-means algorithm that can work with non-Euclideance distance metrics (such as Levenshtein distance). K-medoids (aka PAM), for instance, can be applied to data with an arbitrary distance metric.

For example, using Pycluster's implementation of k-medoids, and nltk's implementation of Levenshtein distance,

import nltk.metrics.distance as distance
import Pycluster as PC

words = ['apple', 'Doppler', 'applaud', 'append', 'barker', 
         'baker', 'bismark', 'park', 'stake', 'steak', 'teak', 'sleek']

dist = [distance.edit_distance(words[i], words[j]) 
        for i in range(1, len(words))
        for j in range(0, i)]

labels, error, nfound = PC.kmedoids(dist, nclusters=3)
cluster = dict()
for word, label in zip(words, labels):
    cluster.setdefault(label, []).append(word)
for label, grp in cluster.items():
    print(grp)

yields a result like

['apple', 'Doppler', 'applaud', 'append']
['stake', 'steak', 'teak', 'sleek']
['barker', 'baker', 'bismark', 'park']

sclv · Accepted Answer · 2016-03-30 15:43:25Z

10

K-means only works with euclidean distance. Edit distances such as Levenshtein don't ~~even obey the triangle inequality~~ may obey the triangle inequality, but are not euclidian. For the sorts of metrics you're interested in, you're better off using a different sort of algorithm, such as Hierarchical clustering: http://en.wikipedia.org/wiki/Hierarchical_clustering

Alternately, just convert your list of RNA into a weighted graph, with Levenshtein weights at the edges, and then decompose it into a minimum spanning tree. The most connected nodes of that tree will be, in a sense, the "most representative".

edited Mar 30, 2016 at 15:43

answered Jun 9, 2011 at 13:58

sclv

39k7 gold badges102 silver badges208 bronze badges

2 Comments

Christoffer Hammarström Over a year ago

Levenshtein Distance and the Triangle Inequality

sclv Over a year ago

Thanks, fixed! Embarrassingly, the author of the blog is a friend of mine :-)

Jerry Coffin · Accepted Answer · 2011-06-09 13:40:15Z

2

K-means doesn't really care about the type of the data involved. All you need to do a K-means is some way to measure a "distance" from one item to another. It'll do its thing based on the distances, regardless of how that happens to be computed from the underlying data.

That said, I haven't used scipy.cluster.vq, so I'm not sure exactly how you tell it the relationship between items, or how to compute a distance from item A to item B.

answered Jun 9, 2011 at 13:40

Jerry Coffin

494k83 gold badges657 silver badges1.2k bronze badges

7 Comments

sclv Over a year ago

This answer doesn't make any sense. What is the "distance" between two strings of RNA such that it A) obeys the triangle inequality and B) is euclidean? There are many clustering algorithms, and it seems beyond me how k-means in particular would be useful in this circumstance.

Doni Over a year ago

The distance I am using is the structural distance, for example sequences: (1) "(((....)))" and (2) "((((..))))" Have a distance of 1 since the only difference in an insertion

Adam Over a year ago

Jerry, can you please explain how this can possibly work? As @sclv mentioned in his answer, K-means only works with Euclidean distance. It seems impossible to apply it to strings since at each step, you need to shift the centroids to an absolute position representing the mean of the closest data points... For arbitrary distance metrics, it seems that K-medoids would work instead since it uses data points as centroids instead.

Jerry Coffin Over a year ago

@codesparkle: I didn't try to tell him that his measure of distance would work--I simply pointed out (at least part of) the requirement for K-means to work. There definitely are measures of differences between strings that fit those requirements (e.g., cosine similarity). Oh, and sclv's claim isn't entirely correct either: cosine similarity isn't strictly a Euclidean measure, but does just fine. Multiple definitions of centroid are accepted wrt k-means (e.g., 3), with differing requirements on the measure used.

Has QUIT--Anony-Mousse Over a year ago

This answer is just wrong. K-means also needs to compute means, and that requires floats, and requires squared Euclidean or Bergman divergences as "distance".

|

Aditya Mittal · Accepted Answer · 2019-12-12 19:05:14Z

What you need for Kmeans is a 'distance' measure (numbers representing a vector so it can find the distances between the vectors and cluster them around centroids based on the distances). Following are some examples I wrote for you:

Let's say you've got strings that represent dates like 2019-06-27 15:52:41.623Z. What you want to do in this case is pick a date say when UTC timestamps start. Now with that starting date and time as the reference, you can calculate the 'distance' to each date string.
Suppose instead, you have code strings, if(a == b) vs. if(a == c) then you might want to use a different 'distance' like the number of characters that differ between the strings.
Or if you have Html DOM structure, <html></html> vs <html><head></head></html> you might not want to count characters but how many tags are different as your 'distance'.
Or for a known enum in database, you could define each key as a different number with your own idea of 'distance' between the enums. For example, 'male', 'female', 'neutral' if you define as the vectors [0], [1], [2] would imply neutral is closer to female than male. So you might instead want to do [0],[2],[1] or [-1],[1],[0].
For RNA/DNA structure asked in the question, the 'distance' could be how many base pairs are different between the strands.

I hope you get the idea. So, you need to consider what is the content of your string and think of the best way to define the 'distance' between your content. Simple character diff distance could work as a generic distance measure between strings, but if you get better distance ideas, your algorithm will work better.

Collectives™ on Stack Overflow

Can I use K-means algorithm on a string?

4 Answers 4

Comments

2 Comments

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related