0

This question is a continuation of my previous question:

RegEx to exclude academic title

I want split paragraph string into array of sentences using regular expression with character dot (.). And the next problem is about number.

Here is an example :

In this year 2013. Hello Mr. Andre, your money is Rp 40.000.

Of course the correct output :

Array ( [0] => In this year 2013 [1] => Hello Mr. Andre, your money is Rp 40.000 )

The title problem (Mr.) is already solved from my question before. I've tried with adding regex of number but still don't work.

My not worked code :

$titles_number=array('(^[0-9]*)','(?<!Mr)', '(?<!Mrs)', '(?<!Ms)');
$sentences=preg_split('/('.implode('',$titles_number).')\./',$text);
print_r($sentences);

Can I do this with one blow (one regex to get rid two problem)? Tell me if I can't do it. Thanks in advance

3
  • Have you tried using the building blocks (?<!\d) (?!\d) for negative lookahead and look behind for digits? Commented May 2, 2013 at 1:50
  • 2
    Although I don't have an answer for you, the site www.regexpal.com is a great way to test regular expressions. It's JavaScript-based, so it updates in real-time. I use it a lot. Commented May 2, 2013 at 1:55
  • Thanks for comment, still trying. regex101.com is worth to try too :D Commented May 2, 2013 at 2:02

2 Answers 2

1

This will be easier to accomplish with preg_match_all():

preg_match_all(
    '/[^\s.][^.]*(?:\.(?:(?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.)|(?=\d))[^.]*)*\./',
    $subject, $result, PREG_PATTERN_ORDER);
print_r($result[0]);

explanation:

  • [^\s.] matches the next non-whitespace character (i.e., skip over any whitespace between sentences)
  • [^.]* gobbles up any non-dot characters
  • \. matches a dot IF...
  • (?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.) ...it's part of an honorific...
  • (?=\d) ...or part of a number

notes:

  1. (?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.) is legal because the alternation is at the top level. That is, it acts like several discrete lookbehinds, each with a fixed length. That's why I had to repeat the \. in every branch instead of using (?<=(?:Prof|Dr|Mr|Mrs|Ms)\.).

  2. \.(?=\d) seems to be sufficient for identifying a dot that's part of a number. If you really have to check for digits before and after the dot, you can use (?=(?<=\d\.)\d) instead.

  3. If this is for anything more serious than a homework problem, you should discard regexes and look for a natural-language processing library. Crude as all this is, it's very close to the limit of what you can do with regexes.

Sign up to request clarification or add additional context in comments.

1 Comment

Wow, this is complete answer and get rid typo problem too, thanks master.. :D
0

You can avoid the number problem (and probably others) if you notice that each dot at the end of a sentence is followed by a space/tab/newline or by the end of the string:

$titles=array('(?<!Mr)', '(?<!Mrs)', '(?<!Ms)');
$sentences=preg_split('/('.implode('',$titles).')\.(?=\s|$)/',$text);
print_r($sentences);

1 Comment

Wow, nice idea. But this is not work when there's typo problem like this : "In this year 2013.Hello Mr. Andre, your money is Rp 40.000." Overall thank you for answer :D

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.