Not sure if this is for meta or stack but I have a very large list of strings and would like to find similarity between them so I can extract the most similar groups and rewrite them as regex to save space.
Right now I am looking at the list and screening it by hand slowly.
Is there an function to do this in python where I would input a list and it will group the strings by similarity? I have scikits-learn but do not want to make my own program if there is one already out there.
Would there be something in NLTK for this?
For example for a scramble list I can get something like this in return or an organized dataset
Cat
hat
bat
rat
snail
mail
fail
pail
rhino
dino
Milhouse
where I would write the regex for them
patterns = ['^(c|h|b|r)at$', '^(sn|m|f|p)ail$', '^(rh|d)ino$', 'Milhouse']