17

I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation.

I was thinking of using the k-means algorithm but I am not sure how to use it with strings. I found scipy.cluster.vq but I don't know how to use it in my case.

thanks!

0

4 Answers 4

15

One problem you would face if using scipy.cluster.vq.kmeans is that that function uses Euclidean distance to measure closeness. To shoe-horn your problem into one solveable by k-means clustering, you'd have to find a way to convert your strings into numerical vectors and be able to justify using Euclidean distance as a reasonable measure of closeness.

That seems... difficult. Perhaps you are looking for Levenshtein distance instead?

Note there are variants of the K-means algorithm that can work with non-Euclideance distance metrics (such as Levenshtein distance). K-medoids (aka PAM), for instance, can be applied to data with an arbitrary distance metric.

For example, using Pycluster's implementation of k-medoids, and nltk's implementation of Levenshtein distance,

import nltk.metrics.distance as distance
import Pycluster as PC

words = ['apple', 'Doppler', 'applaud', 'append', 'barker', 
         'baker', 'bismark', 'park', 'stake', 'steak', 'teak', 'sleek']

dist = [distance.edit_distance(words[i], words[j]) 
        for i in range(1, len(words))
        for j in range(0, i)]

labels, error, nfound = PC.kmedoids(dist, nclusters=3)
cluster = dict()
for word, label in zip(words, labels):
    cluster.setdefault(label, []).append(word)
for label, grp in cluster.items():
    print(grp)

yields a result like

['apple', 'Doppler', 'applaud', 'append']
['stake', 'steak', 'teak', 'sleek']
['barker', 'baker', 'bismark', 'park']
Sign up to request clarification or add additional context in comments.

Comments

10

K-means only works with euclidean distance. Edit distances such as Levenshtein don't even obey the triangle inequality may obey the triangle inequality, but are not euclidian. For the sorts of metrics you're interested in, you're better off using a different sort of algorithm, such as Hierarchical clustering: http://en.wikipedia.org/wiki/Hierarchical_clustering

Alternately, just convert your list of RNA into a weighted graph, with Levenshtein weights at the edges, and then decompose it into a minimum spanning tree. The most connected nodes of that tree will be, in a sense, the "most representative".

2 Comments

Thanks, fixed! Embarrassingly, the author of the blog is a friend of mine :-)
2

K-means doesn't really care about the type of the data involved. All you need to do a K-means is some way to measure a "distance" from one item to another. It'll do its thing based on the distances, regardless of how that happens to be computed from the underlying data.

That said, I haven't used scipy.cluster.vq, so I'm not sure exactly how you tell it the relationship between items, or how to compute a distance from item A to item B.

7 Comments

This answer doesn't make any sense. What is the "distance" between two strings of RNA such that it A) obeys the triangle inequality and B) is euclidean? There are many clustering algorithms, and it seems beyond me how k-means in particular would be useful in this circumstance.
The distance I am using is the structural distance, for example sequences: (1) "(((....)))" and (2) "((((..))))" Have a distance of 1 since the only difference in an insertion
Jerry, can you please explain how this can possibly work? As @sclv mentioned in his answer, K-means only works with Euclidean distance. It seems impossible to apply it to strings since at each step, you need to shift the centroids to an absolute position representing the mean of the closest data points... For arbitrary distance metrics, it seems that K-medoids would work instead since it uses data points as centroids instead.
@codesparkle: I didn't try to tell him that his measure of distance would work--I simply pointed out (at least part of) the requirement for K-means to work. There definitely are measures of differences between strings that fit those requirements (e.g., cosine similarity). Oh, and sclv's claim isn't entirely correct either: cosine similarity isn't strictly a Euclidean measure, but does just fine. Multiple definitions of centroid are accepted wrt k-means (e.g., 3), with differing requirements on the measure used.
This answer is just wrong. K-means also needs to compute means, and that requires floats, and requires squared Euclidean or Bergman divergences as "distance".
|
0

What you need for Kmeans is a 'distance' measure (numbers representing a vector so it can find the distances between the vectors and cluster them around centroids based on the distances). Following are some examples I wrote for you:

  • Let's say you've got strings that represent dates like 2019-06-27 15:52:41.623Z. What you want to do in this case is pick a date say when UTC timestamps start. Now with that starting date and time as the reference, you can calculate the 'distance' to each date string.

  • Suppose instead, you have code strings, if(a == b) vs. if(a == c) then you might want to use a different 'distance' like the number of characters that differ between the strings.

  • Or if you have Html DOM structure, <html></html> vs <html><head></head></html> you might not want to count characters but how many tags are different as your 'distance'.

  • Or for a known enum in database, you could define each key as a different number with your own idea of 'distance' between the enums. For example, 'male', 'female', 'neutral' if you define as the vectors [0], [1], [2] would imply neutral is closer to female than male. So you might instead want to do [0],[2],[1] or [-1],[1],[0].

  • For RNA/DNA structure asked in the question, the 'distance' could be how many base pairs are different between the strands.

I hope you get the idea. So, you need to consider what is the content of your string and think of the best way to define the 'distance' between your content. Simple character diff distance could work as a generic distance measure between strings, but if you get better distance ideas, your algorithm will work better.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.