0

I am using the below regex to split a text at certain ending punctuation however it doesn't work with quotes.

text = "\"Hello my name is Kevin.\" How are you?"
text.scan(/\S.*?[...!!??]/)

=> ["\"Hello my name is Kevin.", "\" How are you?"]

My goal is to produce the following result, but I am not very good with regex expressions. Any help would be greatly appreciated.

=> ["\"Hello my name is Kevin.\"", "How are you?"]
5
  • The input is not a valid english text, as your example, or you are picking a bad example? Commented Feb 17, 2014 at 13:39
  • @mbratch thanks for the link. However, text.split(/([...!!??])\s+/) yields => ["\"Hello my name is Kevin.\" How are you?"] which is not what I am looking for. Commented Feb 17, 2014 at 13:48
  • Your question is not clear. The result is exactly what you stated that you want; the string is split at a punctuation. What is the rule that makes you want otherwise? Commented Feb 17, 2014 at 13:50
  • Sorry, the "simpler form" I showed wasn't quite right. The longer form in the link works, and is similar to what Casimir gave for an answer. Commented Feb 17, 2014 at 13:50
  • @mbratch thanks, I'll check that out as well. Appreciate it! Commented Feb 17, 2014 at 13:55

2 Answers 2

2
text.scan(/"(?>[^"\\]+|\\{2}|\\.)*"|\S.*?[...!!??]/)

The idea is to check for quoted parts before. The subpattern is a bit more elaborated than a simple "[^"]*" to deal with escaped quotes (* see at the end to a more efficient pattern).

pattern details:

"             # literal: a double quote
(?>           # open an atomic group: all that can be between quotes
    [^"\\]+   # all that is not a quote or a backslash
  |           # OR
    \\{2}     # 2 backslashes (the idea is to skip even numbers of backslashes)
  |           # OR
    \\.       # an escaped character (in particular a double quote)
)*            # repeat zero or more times the atomic group
"             # literal double quote
|             # OR
\S.*?[...!!??]

to deal with single quote to you can add: '(?>[^'\\]+|\\{2}|\\.)*'| to the pattern (the most efficient), but if you want make it shorter you can write this:

text.scan(/(['"])(?>[^'"\\]+|\\{2}|\\.|(?!\1)["'])*\1|\S.*?[...!!??]/)

where \1 is a backreference to the first capturing group (the found quote) and (?!\1) means not followed by the found quote.

(*) instead of writing "(?>[^"\\]+|\\{2}|\\.)*", you can use "[^"\\]*+(?:\\.[^"\\]*)*+" that is more efficient.

Sign up to request clarification or add additional context in comments.

2 Comments

can you explain this regex please. It will be really great for noobs like us
Thanks! If I also wanted to include single quotes, is there a more suscinct way to write it than just having another or statement?
1

Add optional quote (["']?) to the pattern:

text.scan(/\S.*?[...!!??]["']?/)
# => ["\"Hello my name is Kevin.\"", "How are you?"]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.