2

Here are my test strings:

  • Word word word; 123-125
  • Word word (1000-1000)
  • Word word word (1000-1000); 99-999
  • Word word word word

What regular expression should I use to extract only those numbers (format: \d+-\d+) that are not within brackets (the ones in bold above)?

I've tried this:

(\d+-\d+)(?!\))

But it's matching:

  • Word word word; 123-125
  • Word word (1000-1000)
  • Word word word (1000-1000); 99-999
  • Word word word word

Note the last digit before the second bracket.

I was trying to drop any match that is followed by a bracket, but it's only dropping one digit rather than the whole match! What am I missing here?

Any help will be greatly appreciated.

0

2 Answers 2

6

You can use a negative look-ahead to get only those values you need like this:

(?![^()]*\))(\d+-\d+)

The (?![^()]*\)) look-ahead actually checks that there are no closing round brackets after the hyphenated numbers.

See demo

Sample code:

import re
p = re.compile(ur'(?![^()]*\))(\d+-\d+)')
test_str = u"Word word word; 123-125\nWord word (1000-1000)\nWord word word (1000-1000); 99-999\nWord word word word"
re.findall(p, test_str)

Output of the sample program:

[u'123-125', u'99-999'] 
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you so much for your complete answer! Much appreciated. Could you explain what is the function of the following part, please? [^()]*
[^()]* is a negated character class that means match any number of characters other than literal ( and ).
Thanks for clarifying that, @stribizhev. What if I wanted to exclude \d+-\d+ that come after a specific string, like ABC 1234-1234, or ABC: 1234-1234? I tried this: (?<!(ABC |ABC: ))(?![^()]*\))(\d+-\d+) but that didn't work at all. The error I got was: look-behind requires fixed-width pattern.
It is not possible to use variable-width look-behind with the standard re library. However, in this case, it is possible to use 2: (?![^()]*\))(?<!ABC )(?<!ABC: )(\d{2}-\d+). Mind that here we need to limit the first digits to {2}, or we'll still get a match.
2

A way consists to describe all you don't want:

[^(\d]*(?:\([^)]*\)[^(\d]*)*

Then you can use an always true assertion: a digits are always preceded by zero or more characters that are not digits and characters between quotes.

You only need to capture the digits in a group:

p = re.compile(r'[^(\d]*(?:\([^)]*\)[^(\d]*)*(\d+-\d+)')

The advantage of this way is that you don't need to test a lookahead at each position in the string, so it is a fast pattern. The inconvenient is that it consumes a little more memory, because the whole match produces more long strings.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.