12

I'm trying to locate all index positions of a string in a list of words and I want the values returned as a list. I would like to find the string if it is on its own, or if it is preceded or followed by punctuation, but not if it is a substring of a larger word.

The following code only captures "cow" only and misses both "test;cow" and "cow."

myList = ['test;cow', 'one', 'two', 'three', 'cow.', 'cow', 'acow']
myString = 'cow'
indices = [i for i, x in enumerate(myList) if x == myString]
print indices
>> 5

I have tried changing the code to use a regular expression:

import re
myList = ['test;cow', 'one', 'two', 'three', 'cow.', 'cow', 'acow']
myString = 'cow'
indices = [i for i, x in enumerate(myList) if x == re.match('\W*myString\W*', myList)]
print indices

But this gives an error: expected string or buffer

If anyone knows what I'm doing wrong I'd be very happy to hear. I have a feeling it's something to do with the fact I'm trying to use a regular expression in there when it's expecting a string. Is there a solution?

The output I'm looking for should read:

>> [0, 4, 5]

Thanks

2 Answers 2

23

You don't need to assign the result of match back to x. And your match should be on x rather than list.

Also, you need to use re.search instead of re.match, since your the regex pattern '\W*myString\W*' will not match the first element. That's because test; is not matched by \W*. Actually, you only need to test for immediate following and preceding character, and not the complete string.

So, you can rather use word boundaries around the string:

pattern = r'\b' + re.escape(myString) + r'\b'
indices = [i for i, x in enumerate(myList) if re.search(pattern, x)]
Sign up to request clarification or add additional context in comments.

Comments

7

There are a few problems with your code. First, you need to match the expr against the list element (x), not against the whole list (myList). Second, in order to insert a variable in the expression, you have to use + (string concatenation). And finally, use raw literals (r'\W) to properly interpet slashes in the expr:

import re
myList = ['test;cow', 'one', 'two', 'three', 'cow.', 'cow', 'acow']
myString = 'cow'
indices = [i for i, x in enumerate(myList) if re.match(r'\W*' + myString + r'\W*', x)]
print indices

If there are chances that myString contains special regexp characters (like a slash or a dot), you'll also need to apply re.escape to it:

regex = r'\W*' + re.escape(myString) + r'\W*'
indices = [i for i, x in enumerate(myList) if re.match(regex, x)]

As pointed out in the comments, the following might be a better option:

regex = r'\b' + re.escape(myString) + r'\b'
indices = [i for i, x in enumerate(myList) if re.search(regex, x)]

5 Comments

Maybe add re.escape too?
This doesn't match the first element, which OP want to match.
Another issue is that the regex doesn't actually provide the output the OP expects (it doesn't match test;cow, for example). I think re.search(r'\b' + myString + r'\b', x) might work.
Thanks for this. I ran into trouble with the r'\b*' which was returning the error "nothing to repeat", as noted in the comment above.
@Adam: yeah, my bad, should be \b not \b*.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.