32

I'm looking for a code in python using regex that can perform something like this

Input: Regex should return "String 1" or "String 2" or "String3"

Output: String 1,String2,String3

I tried r'"*"'

3
  • There could be quotes inside quotes, what would you do with that? Commented Mar 1, 2012 at 16:15
  • No, there wont be any quotes. Just simple string with a-z , 0-9 whitespaces, underscore, mostly alphanumeric without any single or double quotes inside them3 Commented Mar 1, 2012 at 16:17
  • Does this answer your question? Extract string from between quotations Commented Dec 10, 2021 at 2:19

6 Answers 6

73

Here's all you need to do:

def doit(text):      
  import re
  matches = re.findall(r'"(.+?)"',text)
  # matches is now ['String 1', 'String 2', 'String3']
  return ",".join(matches)

doit('Regex should return "String 1" or "String 2" or "String3" ')

result:

'String 1,String 2,String3'

As pointed out by Li-aung Yip:

To elaborate, .+? is the "non-greedy" version of .+. It makes the regular expression match the smallest number of characters it can instead of the most characters it can. The greedy version, .+, will give String 1" or "String 2" or "String 3; the non-greedy version .+? gives String 1, String 2, String 3.

In addition, if you want to accept empty strings, change .+ to .*. Star * means zero or more while plus + means at least one.

Sign up to request clarification or add additional context in comments.

3 Comments

To elaborate, .+? is the "non-greedy" version of .+. It makes the regular expression match the smallest number of characters it can instead of the most characters it can. The greedy version, .+, will give string 1" or "String 2" or "String 3; the non-greedy version .+? gives String 1, String 2, String 3.
Is it necessary to to escape the double quotes? Since you are using a raw string anyway. Wouldn't r'"(.+?)"' suffice? That seems to work on my system.
@Sam It's not necessary. I went ahead and removed them :)
7

The highly up-voted answer doesn't account for the possibility that the double-quoted string might contain one or more double-quote characters escaped with a backslash. To handle this situation, the regex needs to accumulate between the opening and closing double-quotes zero or more matches where each match is either an escaped character sequence (a backslash followed by any character) or any character that is not a double-quote. We further assume that the quoted string exists wholly on a single line and so we do not allow newline characters within our string.

r'"(?:\\.|[^"\n])*"'

  1. " matches a double-quote.
  2. (?: - start of a non-capture group.
  3. \\. - matches a backslash followed by any non-newline character representing an escaped character sequence.
  4. | - "or".
  5. [^"\n] - matches any character other than a double-quote or newline.
  6. ) - end of non-capture group.
  7. * - matches 0 or more occurrences of the previous group.
  8. " - matches a double-quote.

See Regex Demo

import re

def doit(text):
    print('input:', text)
    for i, match in enumerate(re.findall(r'"(?:\\.|[^"\n])*"', text), start=1):
        print(f'match {i}: {match}')
    print()

doit(r'Regex should return "String 1" or "String 2" or "String3" and "\"double quoted string\"" ')
doit(r'"abcdef\"ghij"')
doit(r'"abcdef\\"ghij"')
doit(r'"abcdef\\\"ghij"')

Prints:

input: Regex should return "String 1" or "String 2" or "String3" and "\"double quoted string\""
match 1: "String 1"
match 2: "String 2"
match 3: "String3"
match 4: "\"double quoted string\""

input: "abcdef\"ghij"
match 1: "abcdef\"ghij"

input: "abcdef\\"ghij"
match 1: "abcdef\\"

input: "abcdef\\\"ghij"
match 1: "abcdef\\\"ghij

3 Comments

This approach seems to be unable support a quoted string that ends with a backslash. No amount of backslashing-the-backslashes helped. My experiments here.
@FMc I have updated the regex.
@Gregory I have updated the regex.
4

Just try to fetch double quoted strings from the multiline string:

import re

s = """ 
"my name is daniel"  "mobile 8531111453733"[[[[[[--"i like pandas"
"location chennai"! -asfas"aadhaar du2mmy8969769##69869" 
@4343453 "pincode 642002""@mango,@apple,@berry" 
"""
print(re.findall(r'"(.*?)"', s))

1 Comment

This is the exact same solution as the accepted answer
1

From https://stackoverflow.com/a/69891301/1531728

My solution is:

import re
my_strings = ['SetVariables "a" "b" "c" ', 'd2efw   f "first" +&%#$%"second",vwrfhir, d2e   u"third" dwedew', '"uno"?>P>MNUIHUH~!@#$%^&*()_+=0trewq"due"        "tre"fef    fre f', '       "uno""dos"      "tres"', '"unu""doua""trei"', '      "um"                    "dois"           "tres"                  ']
my_substrings = []
for current_test_string in my_strings:
    for values in re.findall(r'\"(.+?)\"', current_test_string):
        my_substrings.append(values)
        #print("values are:",values,"=")
    print(" my_substrings are:",my_substrings,"=")
    my_substrings = []

Alternate regular expressions to use are:

  • re.findall('"(.+?)"', current_test_string) [Avinash2021] [user17405772021]
  • re.findall('"(.*?)"', current_test_string) [Shelvington2020]
  • re.findall(r'"(.*?)"', current_test_string) [Lundberg2012] [Avinash2021]
  • re.findall(r'"(.+?)"', current_test_string) [Lundberg2012] [Avinash2021]
  • re.findall(r'"["]', current_test_string) [Muthupandi2019]
  • re.findall(r'"([^"]*)"', current_test_string) [Pieters2014]
  • re.findall(r'"(?:(?:(?!(?<!\)").)*)"', current_test_string) # Causes double quotes to remain in the strings, but can be removed via other means. [Booboo2020]
  • re.findall(r'"(.*?)(?<!\)"', current_test_string) [Hassan2014]
  • re.findall('"[^"]*"', current_test_string) # Causes double quotes to remain in the strings, but can be removed via other means. [Martelli2013]
  • re.findall('"([^"]*)"', current_test_string) [jspcal2014]
  • re.findall("'(.*?)'", current_test_string) [akhilmd2016]

The current_test_string.split("\"") approach works if the strings have patterns in which substrings are embedded within quotation marks. This is because it uses the double quotation mark in this example as a delimiter to tokenize the string, and accepts substrings that are not embedded within double quotation marks as valid substring extractions from the string.

References:

Comments

1

For me the only regex that ever worked right for all the cases of quoted strings with possibly escaped quotes inside of them was:

regex=r"""(['"])(?:\\\\|\\\1|[^\1])*?\1"""

This will not fail even if the quoted string ends with an escaped backslash.

1 Comment

This answer is under-appreciated -- especially the issue noted at the end. As best I can tell, it is easier to understand and works better than the alternatives I've seen so far.
-2
import re
r=r"'(\\'|[^'])*(?!<\\)'|\"(\\\"|[^\"])*(?!<\\)\""

texts=[r'"aerrrt"',
r'"a\"e'+"'"+'rrt"',
r'"a""""arrtt"""""',
r'"aerrrt',
r'"a\"errt'+"'",
r"'aerrrt'",
r"'a\'e"+'"'+"rrt'",
r"'a''''arrtt'''''",
r"'aerrrt",
r"'a\'errt"+'"',
      "''",'""',""]

for text in texts:
     print (text,"-->",re.fullmatch(r,text))

results:

"aerrrt" --> <_sre.SRE_Match object; span=(0, 8), match='"aerrrt"'>
"a\"e'rrt" --> <_sre.SRE_Match object; span=(0, 10), match='"a\\"e\'rrt"'>
"a""""arrtt""""" --> None
"aerrrt --> None
"a\"errt' --> None
'aerrrt' --> <_sre.SRE_Match object; span=(0, 8), match="'aerrrt'">
'a\'e"rrt' --> <_sre.SRE_Match object; span=(0, 10), match='\'a\\\'e"rrt\''>
'a''''arrtt''''' --> None
'aerrrt --> None
'a\'errt" --> None
'' --> <_sre.SRE_Match object; span=(0, 2), match="''">
"" --> <_sre.SRE_Match object; span=(0, 2), match='""'>
 --> None

1 Comment

Can you explain how your answer is working in some details. I am unable to find it according to question asked.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.