RegEx to exclude number using PHP

Question

This question is a continuation of my previous question:

RegEx to exclude academic title

I want split paragraph string into array of sentences using regular expression with character dot (.). And the next problem is about number.

Here is an example :

In this year 2013. Hello Mr. Andre, your money is Rp 40.000.

Of course the correct output :

Array ( [0] => In this year 2013 [1] => Hello Mr. Andre, your money is Rp 40.000 )

The title problem (Mr.) is already solved from my question before. I've tried with adding regex of number but still don't work.

My not worked code :

$titles_number=array('(^[0-9]*)','(?<!Mr)', '(?<!Mrs)', '(?<!Ms)');
$sentences=preg_split('/('.implode('',$titles_number).')\./',$text);
print_r($sentences);

Can I do this with one blow (one regex to get rid two problem)? Tell me if I can't do it. Thanks in advance

Have you tried using the building blocks (?<!\d) (?!\d) for negative lookahead and look behind for digits? — Patashu
– Patashu, Commented May 2, 2013 at 1:50
Although I don't have an answer for you, the site www.regexpal.com is a great way to test regular expressions. It's JavaScript-based, so it updates in real-time. I use it a lot. — blainarmstrong
– blainarmstrong, Commented May 2, 2013 at 1:55
Thanks for comment, still trying. regex101.com is worth to try too :D — andrefadila
– andrefadila, Commented May 2, 2013 at 2:02

Alan Moore · Accepted Answer · 2013-05-02 04:48:38Z

1

This will be easier to accomplish with preg_match_all():

preg_match_all(
    '/[^\s.][^.]*(?:\.(?:(?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.)|(?=\d))[^.]*)*\./',
    $subject, $result, PREG_PATTERN_ORDER);
print_r($result[0]);

explanation:

[^\s.] matches the next non-whitespace character (i.e., skip over any whitespace between sentences)
[^.]* gobbles up any non-dot characters
\. matches a dot IF...
(?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.) ...it's part of an honorific...
(?=\d) ...or part of a number

notes:

(?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.) is legal because the alternation is at the top level. That is, it acts like several discrete lookbehinds, each with a fixed length. That's why I had to repeat the \. in every branch instead of using (?<=(?:Prof|Dr|Mr|Mrs|Ms)\.).
\.(?=\d) seems to be sufficient for identifying a dot that's part of a number. If you really have to check for digits before and after the dot, you can use (?=(?<=\d\.)\d) instead.
If this is for anything more serious than a homework problem, you should discard regexes and look for a natural-language processing library. Crude as all this is, it's very close to the limit of what you can do with regexes.

answered May 2, 2013 at 4:48

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

andrefadila Over a year ago

Wow, this is complete answer and get rid typo problem too, thanks master.. :D

Casimir et Hippolyte · Accepted Answer · 2013-05-02 04:19:26Z

0

You can avoid the number problem (and probably others) if you notice that each dot at the end of a sentence is followed by a space/tab/newline or by the end of the string:

$titles=array('(?<!Mr)', '(?<!Mrs)', '(?<!Ms)');
$sentences=preg_split('/('.implode('',$titles).')\.(?=\s|$)/',$text);
print_r($sentences);

answered May 2, 2013 at 4:19

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

1 Comment

andrefadila Over a year ago

Wow, nice idea. But this is not work when there's typo problem like this : "In this year 2013.Hello Mr. Andre, your money is Rp 40.000." Overall thank you for answer :D

Collectives™ on Stack Overflow

RegEx to exclude number using PHP

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related