1

I need to split a paragraph into sentences. That's where i got a bit confused with the regex.

I have already referred this question to which this Q is marked as a duplicate to. but the issue here is different.

Here is a example of the string i need to split :

hello! how are you? how is life
live life, live free. "isnt it?"

here is the code i tried :

$sentence_array = preg_split('/([.!?\r\n|\r|\n])+(?![^"]*")/', $paragraph, -1);

What i need is :

array (  
  [0] => "hello"  
  [1] => "how are you"  
  [2] => "how is life"  
  [3] => "live life, live free"  
  [4] => ""isnt it?""  
)

What i get is :

array(
  [0] => "hello! how are you? how is life live life, live free. "isnt it?""
)

When i do not have any quotes in the string, the split works as required.

Any help is appreciated. Thank you.

3
  • Possible duplicate of Explode a paragraph into sentences in PHP Commented Sep 28, 2018 at 8:10
  • 2
    You might try something like '~"[^"]*"(*SKIP)(*F)|\s*[.!?\r\n]\s*~', see demo. Commented Sep 28, 2018 at 8:11
  • @H2ONOCK i had seen that one. but my issue here is specific and different. I have the split working fine without quotation marks. Commented Sep 28, 2018 at 8:14

2 Answers 2

2

There are some problems with your regular expression that the main of them is confusing group constructs with character classes. A pipe | in a character class means a | literally. It doesn't have any special meaning.

What you need is this:

("[^"]*")|[!?.]+\s*|\R+

This first tries to match a string enclosed in double quotation marks (and captures the content). Then tries to match any punctuation marks from [!?.] set to split on them. Then goes for any kind of newline characters if found.

PHP:

var_dump(preg_split('~("[^"]*")|[!?.]+\s*|\R+~', <<<STR
hello! how are you? how is life
live life, live free. "isnt it?"
STR
, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY));

Output:

array(5) {
  [0]=>
  string(5) "hello"
  [1]=>
  string(11) "how are you"
  [2]=>
  string(11) "how is life"
  [3]=>
  string(20) "live life, live free"
  [4]=>
  string(10) ""isnt it?""
}
Sign up to request clarification or add additional context in comments.

5 Comments

thats a great solution... Thank you :)
i have a little doubt here. can i use it for all UTF-8 strings? When i used it over Hindi text, it creates a � character when it finds अ and breaks(splits right there). any idea?
Try to enable u flag i.e. ~("[^"]*")|[!?.]+\s*|\R+~u
You just saved a day for me... you are a genius! :)
i have a small query here. How do we ignore parentheses( ) as well? I have tried this |(\([^"]*\)). didn't help!
1

I view your problem of splitting based on certain punctuation already solved, except that it fails in the case of double quotes. We can phrase a solution as saying that we should split when seeing such punctuation, or when seeing this punctuation followed by a double quote.

The split should happen when the previous character matches one of your markers and what follows is not a double quote, or the previous two characters should be a marker and a double quote. This implies splitting on the following pattern, which uses lookarounds:

(?<=[.!?\r\n])(?=[^"])|(?<=[.!?\r\n]")(?=.)

Code sample:

$input = "hello! how \"are\" \"you?\" how is life\nlive life, live free. \"isnt it?\"";
$sentence_array = preg_split('/(?<=[.!?\r\n])(?=[^"])|(?<=[.!?\r\n]\")(?=.)/', $input, -1);
print_r($sentence_array);

Array ( [0] => hello! [1] => how "are" "you?" [2] => how is life
    [3] => live life, live free. [4] => "isnt it?" )

3 Comments

OP wants to match sentence end punctuation outside of "s. Lookarounds won't help, your solution will fail in many cases.
I have a small issue here, the \r\n, \r and \n are still in the string now. everything else is great. thank you Tim.
I don't have a fix for that. I can only suggest removing them afterwards. The lookaround trick I used does not consume anything, this is why it leaves the double quotes untouched. But this also means that newlines/carriage returns would also not be removed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.