2

If I have this string:

s = "this, that; talk, love, hate; good, bad, all good."

And I want to extract the items separated by , ; or .

So the result I want is:

["this", "that", "talk", "love", "hate", "good", "bad", "all good"]

If I use this Python regular expression:

re.findall(r"([a-z]+[,;.])+", s)

I get the result:

['this,', 'that;', 'talk,', 'love,', 'hate;', 'good,', 'bad,', 'good.']

which is close to what I want, except for the last item.

Strangely, if I include a space in the first square bracket, as in:

re.findall(r"([a-z ]+[,;.])+", s)

then I only get this result:

[' all good.']

But findall() is supposed to find all results, no? Can someone explain this strange behavior?

2
  • re.split() may be better for your use case here. Commented Jul 3, 2013 at 1:45
  • Thanks for all your answers, I can now solve the problem. But originally I had a confusion about findall(), I thought it returns the different instances of (xyz)+, but it actually tries to find the pattern "afresh" from the last position. I guess there is no way to make an re return all the instances matched by a "+" ? Commented Jul 3, 2013 at 3:16

4 Answers 4

3

Your goal is to split a string into tokens by a separator, so a better way to do this than with re.findall() is with re.split(). In this case, you can use

>>> re.split(r"[,;.]\s", s)
['this', 'that', 'talk', 'love', 'hate', 'good', 'bad', 'all good.']

Unfortunately, this method either puts the period at the end of the last item if you use [,;.]\s as the regular expression, and adds an empty string at the end of the result list if you instead use [,;.]\s? as the regular expression. We can deal with this, however, by removing the last string:

>>> re.split(r"[,;.]\s?", s)[:-1]
['this', 'that', 'talk', 'love', 'hate', 'good', 'bad', 'all good']
Sign up to request clarification or add additional context in comments.

Comments

1

You can use lookahead:

>>> list(re.findall(r"([a-z][a-z ]+(?=[,;.]))+", s))
['this', 'that', 'talk', 'love', 'hate', 'good', 'bad', 'all good']

But re.split() recommended by @murgatroid99 is better.

1 Comment

In the example output, he wanted "all good" as the last string, not "good", and you dropped the word "all" entirely
1

You can use:

re.findall(r'[\w\s]+', s)

Comments

0

The + (before close-quote) is outside of the bracket. Put it inside, thus:

re.findall(r"\s*([a-z ]+)[ ,;.]+", s)

5 Comments

it will match the whole bracketed expression any number>0 of times
I don't quite understand the result he gets.. shouldn't that match the whole string?
['this,', ' that;', ' talk,', ' love,', ' hate;', ' good,', ' bad,', ' all good.']. it simply doesn't do the job.
Sorry, thought that that was what he wanted. Edited to fix.
This one does not correctly with strings that begin with a space.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.