regular expression findall() in Python

Question

If I have this string:

s = "this, that; talk, love, hate; good, bad, all good."

And I want to extract the items separated by , ; or .

So the result I want is:

["this", "that", "talk", "love", "hate", "good", "bad", "all good"]

If I use this Python regular expression:

re.findall(r"([a-z]+[,;.])+", s)

I get the result:

['this,', 'that;', 'talk,', 'love,', 'hate;', 'good,', 'bad,', 'good.']

which is close to what I want, except for the last item.

Strangely, if I include a space in the first square bracket, as in:

re.findall(r"([a-z ]+[,;.])+", s)

then I only get this result:

[' all good.']

But findall() is supposed to find all results, no? Can someone explain this strange behavior?

Thanks for all your answers, I can now solve the problem. But originally I had a confusion about findall(), I thought it returns the different instances of (xyz)+, but it actually tries to find the pattern "afresh" from the last position. I guess there is no way to make an re return all the instances matched by a "+" ? — Yan King Yin
– Yan King Yin, Commented Jul 3, 2013 at 3:16

murgatroid99 · Accepted Answer · 2013-07-03 02:10:20Z

3

Your goal is to split a string into tokens by a separator, so a better way to do this than with re.findall() is with re.split(). In this case, you can use

>>> re.split(r"[,;.]\s", s)
['this', 'that', 'talk', 'love', 'hate', 'good', 'bad', 'all good.']

Unfortunately, this method either puts the period at the end of the last item if you use [,;.]\s as the regular expression, and adds an empty string at the end of the result list if you instead use [,;.]\s? as the regular expression. We can deal with this, however, by removing the last string:

>>> re.split(r"[,;.]\s?", s)[:-1]
['this', 'that', 'talk', 'love', 'hate', 'good', 'bad', 'all good']

edited Jul 3, 2013 at 2:10

answered Jul 3, 2013 at 1:50

murgatroid99

20.4k10 gold badges65 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Elazar · Accepted Answer · 2013-07-03 02:05:09Z

1

You can use lookahead:

>>> list(re.findall(r"([a-z][a-z ]+(?=[,;.]))+", s))
['this', 'that', 'talk', 'love', 'hate', 'good', 'bad', 'all good']

But re.split() recommended by @murgatroid99 is better.

edited Jul 3, 2013 at 2:05

answered Jul 3, 2013 at 1:51

Elazar

22k4 gold badges51 silver badges68 bronze badges

1 Comment

murgatroid99 Over a year ago

In the example output, he wanted "all good" as the last string, not "good", and you dropped the word "all" entirely

jkloo · Accepted Answer · 2014-02-04 14:22:10Z

1

You can use:

re.findall(r'[\w\s]+', s)

answered Feb 4, 2014 at 14:22

jkloo

1711 silver badge4 bronze badges

Comments

AMADANON Inc. · Accepted Answer · 2013-07-03 20:25:08Z

0

The + (before close-quote) is outside of the bracket. Put it inside, thus:

re.findall(r"\s*([a-z ]+)[ ,;.]+", s)

edited Jul 3, 2013 at 20:25

answered Jul 3, 2013 at 1:42

AMADANON Inc.

5,92924 silver badges32 bronze badges

5 Comments

Elazar Over a year ago

it will match the whole bracketed expression any number>0 of times

Karoly Horvath Over a year ago

I don't quite understand the result he gets.. shouldn't that match the whole string?

Elazar Over a year ago

['this,', ' that;', ' talk,', ' love,', ' hate;', ' good,', ' bad,', ' all good.']. it simply doesn't do the job.

AMADANON Inc. Over a year ago

Sorry, thought that that was what he wanted. Edited to fix.

Elazar Over a year ago

This one does not correctly with strings that begin with a space.

Collectives™ on Stack Overflow

regular expression findall() in Python

4 Answers 4

Comments

1 Comment

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related