Ruby regex to split text

Question

I am using the below regex to split a text at certain ending punctuation however it doesn't work with quotes.

text = "\"Hello my name is Kevin.\" How are you?"
text.scan(/\S.*?[.．.！!?？]/)

=> ["\"Hello my name is Kevin.", "\" How are you?"]

My goal is to produce the following result, but I am not very good with regex expressions. Any help would be greatly appreciated.

=> ["\"Hello my name is Kevin.\"", "How are you?"]

The input is not a valid english text, as your example, or you are picking a bad example? — fotanus
– fotanus, Commented Feb 17, 2014 at 13:39
@mbratch thanks for the link. However, text.split(/([.．.！!?？])\s+/) yields => ["\"Hello my name is Kevin.\" How are you?"] which is not what I am looking for. — diasks2
– diasks2, Commented Feb 17, 2014 at 13:48
Your question is not clear. The result is exactly what you stated that you want; the string is split at a punctuation. What is the rule that makes you want otherwise? — sawa
– sawa, Commented Feb 17, 2014 at 13:50
Sorry, the "simpler form" I showed wasn't quite right. The longer form in the link works, and is similar to what Casimir gave for an answer. — lurker
– lurker, Commented Feb 17, 2014 at 13:50
@mbratch thanks, I'll check that out as well. Appreciate it! — diasks2
– diasks2, Commented Feb 17, 2014 at 13:55

Casimir et Hippolyte · Accepted Answer · 2015-11-02 21:33:19Z

2

text.scan(/"(?>[^"\\]+|\\{2}|\\.)*"|\S.*?[.．.！!?？]/)

The idea is to check for quoted parts before. The subpattern is a bit more elaborated than a simple "[^"]*" to deal with escaped quotes (* see at the end to a more efficient pattern).

pattern details:

"             # literal: a double quote
(?>           # open an atomic group: all that can be between quotes
    [^"\\]+   # all that is not a quote or a backslash
  |           # OR
    \\{2}     # 2 backslashes (the idea is to skip even numbers of backslashes)
  |           # OR
    \\.       # an escaped character (in particular a double quote)
)*            # repeat zero or more times the atomic group
"             # literal double quote
|             # OR
\S.*?[.．.！!?？]

to deal with single quote to you can add: '(?>[^'\\]+|\\{2}|\\.)*'| to the pattern (the most efficient), but if you want make it shorter you can write this:

text.scan(/(['"])(?>[^'"\\]+|\\{2}|\\.|(?!\1)["'])*\1|\S.*?[.．.！!?？]/)

where \1 is a backreference to the first capturing group (the found quote) and (?!\1) means not followed by the found quote.

(*) instead of writing "(?>[^"\\]+|\\{2}|\\.)*", you can use "[^"\\]*+(?:\\.[^"\\]*)*+" that is more efficient.

edited Nov 2, 2015 at 21:33

answered Feb 17, 2014 at 13:40

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

aelor Over a year ago

can you explain this regex please. It will be really great for noobs like us

diasks2 Over a year ago

Thanks! If I also wanted to include single quotes, is there a more suscinct way to write it than just having another or statement?

falsetru · Accepted Answer · 2014-02-17 14:21:36Z

1

Add optional quote (["']?) to the pattern:

text.scan(/\S.*?[.．.！!?？]["']?/)
# => ["\"Hello my name is Kevin.\"", "How are you?"]

answered Feb 17, 2014 at 14:21

falsetru

371k69 gold badges770 silver badges660 bronze badges

Collectives™ on Stack Overflow

Ruby regex to split text

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related