Clustering string data with ELKI

Question

I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I

(a) load string data in ELKI from a file (only "Labels")?

(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)

Some code snippets or example input files would be helpful.

SO is dead · Accepted Answer · 2017-07-21 08:08:53Z

1

It's actually pretty straightforward:

A) write a Parser that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser, producing a relation of the desired data type (probably you can just use String. If you want to be a bit more general TokenSequence may be a more appropriate concept for these distances. Strings are just the simplest case.

B) implement a DistanceFunction based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>. Again, subclassing AbstractPrimitiveDistanceFunction may be the easiest thing to do.

For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.

A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.

edited Jul 21, 2017 at 8:08

answered Mar 18, 2013 at 18:40

SO is dead

8,7352 gold badges29 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Stahli Over a year ago

Thanks for your quick and helpful advise - it works now. The only thing I changed was extending AbstractParser and Parser instead of AbstractStreamingParser (like i.e. BitVectorLabelParser) since I got [1] when I produced Strings with a AbstractStreamingParser subclass. I will research on indexing algorithms in case I need it. [1] NoSupportedDataTypeException: No data type found satisfying: String Available types: DBID...

SO is dead Over a year ago

As for [1], you probably did not emit a Event.META_CHANGED event. But other than that, would you consider contributing your code to ELKI, so that others can use it, too? I'm sure you are not the only one interested in Levenshtein distance and Strings.

Collectives™ on Stack Overflow

Clustering string data with ELKI

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related