Python findall does not return expected values

Question

I have some strings that contains info between two quotes like:

cc "1/11/2A" "1/20+21/1 1" "XX" 0

I am using re.findall('\"*\"', line) to match parts between quotes but doesn't work for some reason. I tried many other things but all I get is some empty list with nothing in it. What am I doing wrong ?

Martijn Pieters · Accepted Answer · 2013-02-08 11:57:01Z

4

You are matching 0 or more quotes followed by a quote. Use a negative character class instead:

re.findall(r'"[^"]*"', line)

You may want to put a capturing group around the negative character class:

re.findall(r'"([^"]*)"', line)

and now .findall() returns everything within quotes, not including the quotes themselves:

>>> import re
>>> re.findall(r'"([^"]*)"', 'cc "1/11/2A" "1/20+21/1 1" "XX" 0')
['1/11/2A', '1/20+21/1 1', 'XX']

The [^...] negative character class notation means: match any character that is not included in the set of characters named here. [^"] thus matches any character that is not a quote, neatly limiting the matched characters to everything that is within quotes.

edited Feb 8, 2013 at 11:57

answered Feb 8, 2013 at 11:47

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

abarnert Over a year ago

You could just use .*? instead of [^"]* here; it might be easier to understand.

Martijn Pieters Over a year ago

I rather be explicit with 'anything that is not a quote'; not sure if greedy vs. non-greedy is any easier to grasp.

abarnert Over a year ago

I'm not sure either. In this case, I think it's closer to what the OP might have been trying to accomplish, but… really, that's a stab in the dark.

Janne Karila Over a year ago

@abarnert The problem with .*? is that it will match a quote if it must. With this regex that won't happen, but in general it is better to be explicit.

abarnert Over a year ago

@JanneKarila: That's true. Since the OP's intent is ambiguous in English (which is fine, because it doesn't matter), it's hard to say which regex is a better translation for the intent… but you're probably right that the ^" case will come up more often than the .*? case when it matters.

Lev Levitsky · Accepted Answer · 2013-02-08 11:47:24Z

2

It should be r'"[^"]*"'. Your pattern matches one or more " characters in a row.

In [4]: re.findall(r'"[^"]*"', line)
Out[4]: ['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

answered Feb 8, 2013 at 11:47

Lev Levitsky

66.4k23 gold badges155 silver badges184 bronze badges

1 Comment

Lev Levitsky Over a year ago

Martijn is online. I can as well go do something else than answering on SO :)

abarnert · Accepted Answer · 2013-02-11 18:05:48Z

2

It looks like you were expecting * to match "anything", the way it does in filename wildcards.

But that's not what it means in regex. It modifies the preceding expression, to match zero or more copies of that expression.

To get filename-style wildcard, you want to use .*.

However, that won't actually work, because . matches anything—including ". So, it will grab everything up to the very last " character, leaving only that for the rest of the expression, meaning findall will find one big string instead of three small ones.

You can fix that by making the repetition non-greedy, with .*?. This will match everything up to the first ".

So:

>>> re.findall('\".*?\"', line)
['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

I think Martijn Pieters's answer is probably conceptually clearer; I've only offered this because I think this may be the way you were trying to attack the problem, and I wanted to show how you could have gotten there.

As a side note, regex code is much easier to read if you use raw strings, so you can get rid of the excess backslash escapes. In this case, the backslashes are already unnecessary—you don't need to escape double-quotes in either a single-quoted string or a regex. But instead of trying to remember what does and doesn't need to be escaped by the Python parser so it can get to the regex parser, it's easier to just use raw strings. So:

>>> re.findall(r'".*?"', line)
['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

edited Feb 11, 2013 at 18:05

answered Feb 8, 2013 at 11:56

abarnert

368k54 gold badges626 silver badges692 bronze badges

2 Comments

Martijn Pieters Over a year ago

The backslashes for the " double quote characters are not needed in the non-raw-string-literal case either.

Martijn Pieters Over a year ago

Completely unrelated, have you found the Python Chat room here on SO yet?

Collectives™ on Stack Overflow

Python findall does not return expected values

3 Answers 3

5 Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related