19

I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions. With the .split method I search for a dot followed by a capital letter like

"\. A-Z"

However I need to refine this rule in the following way: The . (dot) may not be preceeded by either Abs or S. And if it is followed by a capital letter (A-Z), it should still not match if it is a month name, like January | February | March.

I tried implementing the first half, but even this did not work. My code was:

"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) "
1
  • I suggest replacing the space with \s+ (or \s if it really needs to be exactly one space). And matching an uppercase letter would be [A-Z] (you forgot the brackets). Commented Oct 2, 2012 at 11:13

5 Answers 5

24

First, I think you may want to replace the space with \s+, or \s if it really is exactly one space (you often find double spaces in English text).

Second, to match an uppercase letter you have to use [A-Z], but A-Z will not work (but remember there may be other uppercase letters than A-Z ...).

Additionally, I think I know why this does not work. The regular expression engine will try to match \. [A-Z] if it is not preceeded by Abs or S. The thing is that, if it is preceeded by an S, it is not preceeded by Abs, so the first pattern matches. If it is preceeded by Abs, it is not preceeded by S, so the second pattern version matches. In either way one of those patterns will match since Abs and S are mutually exclusive.

The pattern for the first part of your question could be

(?<!Abs)(?<!S)(\. [A-Z])

or

(?<!Abs)(?<!S)(\.\s+[A-Z])

(with my suggestion)

That is because you have to avoid |, without it the expression now says not preceeded by Abs and not preceeded by S. If both are true the pattern matcher will continue to scan the string and find your match.

To exclude the month names I came up with this regular expression:

(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]

The same arguments hold for the negative look ahead patterns.

Sign up to request clarification or add additional context in comments.

2 Comments

I'm new to multiple look-behinds. It looks like (?<!Abs)(?<!S) does the same as (?<!Abs|S). Is there any advantage to either (beyond personal preference on brevity/readability)?
@jhiro009 Yes, when you lump them together with the OR (pipe) operator, regex requires them to be a fixed-width pattern, so Abs and S are incompatible. You'd have to use the former case, in this situation.
10

I'm adding a short answer to the question in the title, since this is at the top of Google's search results:

The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:

"(?<!1)(?<!12)(?<!123)example"

This would match example 2example and 3example but not 1example 12example or 123example.

Comments

1

Use nltk punkt tokenizer. It's probably more robust than using regex.

>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

Comments

1

Use nltk or similar tools as suggested by @root.

To answer your regex question:

import re
import sys

print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
               sys.stdin.read())

Input

First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth

Output

['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
 'S. Sixth', 'ABs', 'Eighth']

Comments

-2

You can use Set [].

'(?<![1,2,3]example)'

This would not match 1example, 2example, 3example.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.