0

I have some strings that contains info between two quotes like:

cc "1/11/2A" "1/20+21/1 1" "XX" 0

I am using re.findall('\"*\"', line) to match parts between quotes but doesn't work for some reason. I tried many other things but all I get is some empty list with nothing in it. What am I doing wrong ?

0

3 Answers 3

4

You are matching 0 or more quotes followed by a quote. Use a negative character class instead:

re.findall(r'"[^"]*"', line)

You may want to put a capturing group around the negative character class:

re.findall(r'"([^"]*)"', line)

and now .findall() returns everything within quotes, not including the quotes themselves:

>>> import re
>>> re.findall(r'"([^"]*)"', 'cc "1/11/2A" "1/20+21/1 1" "XX" 0')
['1/11/2A', '1/20+21/1 1', 'XX']

The [^...] negative character class notation means: match any character that is not included in the set of characters named here. [^"] thus matches any character that is not a quote, neatly limiting the matched characters to everything that is within quotes.

Sign up to request clarification or add additional context in comments.

5 Comments

You could just use .*? instead of [^"]* here; it might be easier to understand.
I rather be explicit with 'anything that is not a quote'; not sure if greedy vs. non-greedy is any easier to grasp.
I'm not sure either. In this case, I think it's closer to what the OP might have been trying to accomplish, but… really, that's a stab in the dark.
@abarnert The problem with .*? is that it will match a quote if it must. With this regex that won't happen, but in general it is better to be explicit.
@JanneKarila: That's true. Since the OP's intent is ambiguous in English (which is fine, because it doesn't matter), it's hard to say which regex is a better translation for the intent… but you're probably right that the ^" case will come up more often than the .*? case when it matters.
2

It should be r'"[^"]*"'. Your pattern matches one or more " characters in a row.

In [4]: re.findall(r'"[^"]*"', line)
Out[4]: ['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

1 Comment

Martijn is online. I can as well go do something else than answering on SO :)
2

It looks like you were expecting * to match "anything", the way it does in filename wildcards.

But that's not what it means in regex. It modifies the preceding expression, to match zero or more copies of that expression.

To get filename-style wildcard, you want to use .*.

However, that won't actually work, because . matches anything—including ". So, it will grab everything up to the very last " character, leaving only that for the rest of the expression, meaning findall will find one big string instead of three small ones.

You can fix that by making the repetition non-greedy, with .*?. This will match everything up to the first ".

So:

>>> re.findall('\".*?\"', line)
['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

I think Martijn Pieters's answer is probably conceptually clearer; I've only offered this because I think this may be the way you were trying to attack the problem, and I wanted to show how you could have gotten there.

As a side note, regex code is much easier to read if you use raw strings, so you can get rid of the excess backslash escapes. In this case, the backslashes are already unnecessary—you don't need to escape double-quotes in either a single-quoted string or a regex. But instead of trying to remember what does and doesn't need to be escaped by the Python parser so it can get to the regex parser, it's easier to just use raw strings. So:

>>> re.findall(r'".*?"', line)
['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

2 Comments

The backslashes for the " double quote characters are not needed in the non-raw-string-literal case either.
Completely unrelated, have you found the Python Chat room here on SO yet?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.