Python finding regex patterns for a large list of strings

Question

Not sure if this is for meta or stack but I have a very large list of strings and would like to find similarity between them so I can extract the most similar groups and rewrite them as regex to save space.

Right now I am looking at the list and screening it by hand slowly.

Is there an function to do this in python where I would input a list and it will group the strings by similarity? I have scikits-learn but do not want to make my own program if there is one already out there.

Would there be something in NLTK for this?

For example for a scramble list I can get something like this in return or an organized dataset

Cat
hat
bat
rat

snail
mail
fail
pail

rhino
dino

Milhouse

where I would write the regex for them

patterns = ['^(c|h|b|r)at$', '^(sn|m|f|p)ail$', '^(rh|d)ino$', 'Milhouse']

dstromberg · Accepted Answer · 2014-01-25 03:16:04Z

1

I don't know if NLTK has this or not, but this sounds like what Burkhard-Keller Trees are for. I don't think they're in the standard library, but there's at least one Python implementation of them available.

If you want to stick to the standard library, you could try difflib.get_close_matches(), but it might be slower.

answered Jan 25, 2014 at 3:16

dstromberg

7,2412 gold badges32 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3084006 Over a year ago

Yes I think that is what I want. It is just I can get return a group of words where I can manually write patterns for saving the trouble of going over large sets.

user3084006 Over a year ago

I found a implementation of the BK tree but do not understnad the results code.activestate.com/recipes/572156-bk-tree

Collectives™ on Stack Overflow

Python finding regex patterns for a large list of strings

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related