1

Not sure if this is for meta or stack but I have a very large list of strings and would like to find similarity between them so I can extract the most similar groups and rewrite them as regex to save space.

Right now I am looking at the list and screening it by hand slowly.

Is there an function to do this in python where I would input a list and it will group the strings by similarity? I have scikits-learn but do not want to make my own program if there is one already out there.

Would there be something in NLTK for this?

For example for a scramble list I can get something like this in return or an organized dataset

Cat
hat
bat
rat

snail
mail
fail
pail

rhino
dino

Milhouse

where I would write the regex for them

patterns = ['^(c|h|b|r)at$', '^(sn|m|f|p)ail$', '^(rh|d)ino$', 'Milhouse']

1 Answer 1

1

I don't know if NLTK has this or not, but this sounds like what Burkhard-Keller Trees are for. I don't think they're in the standard library, but there's at least one Python implementation of them available.

If you want to stick to the standard library, you could try difflib.get_close_matches(), but it might be slower.

Sign up to request clarification or add additional context in comments.

2 Comments

Yes I think that is what I want. It is just I can get return a group of words where I can manually write patterns for saving the trouble of going over large sets.
I found a implementation of the BK tree but do not understnad the results code.activestate.com/recipes/572156-bk-tree

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.