2

I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I

(a) load string data in ELKI from a file (only "Labels")?

(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)

Some code snippets or example input files would be helpful.

1 Answer 1

1

It's actually pretty straightforward:

A) write a Parser that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser, producing a relation of the desired data type (probably you can just use String. If you want to be a bit more general TokenSequence may be a more appropriate concept for these distances. Strings are just the simplest case.

B) implement a DistanceFunction based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>. Again, subclassing AbstractPrimitiveDistanceFunction may be the easiest thing to do.

For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.

A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your quick and helpful advise - it works now. The only thing I changed was extending AbstractParser and Parser instead of AbstractStreamingParser (like i.e. BitVectorLabelParser) since I got [1] when I produced Strings with a AbstractStreamingParser subclass. I will research on indexing algorithms in case I need it. [1] NoSupportedDataTypeException: No data type found satisfying: String Available types: DBID...
As for [1], you probably did not emit a Event.META_CHANGED event. But other than that, would you consider contributing your code to ELKI, so that others can use it, too? I'm sure you are not the only one interested in Levenshtein distance and Strings.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.