How to extract numbers from string using regular expression in Python except when within brackets?

Question

Here are my test strings:

Word word word; 123-125
Word word (1000-1000)
Word word word (1000-1000); 99-999
Word word word word

What regular expression should I use to extract only those numbers (format: \d+-\d+) that are not within brackets (the ones in bold above)?

I've tried this:

(\d+-\d+)(?!\))

But it's matching:

Word word word; 123-125
Word word (1000-1000)
Word word word (1000-1000); 99-999
Word word word word

Note the last digit before the second bracket.

I was trying to drop any match that is followed by a bracket, but it's only dropping one digit rather than the whole match! What am I missing here?

Any help will be greatly appreciated.

Wiktor Stribiżew · Accepted Answer · 2015-04-28 10:41:58Z

6

You can use a negative look-ahead to get only those values you need like this:

(?![^()]*\))(\d+-\d+)

The (?![^()]*\)) look-ahead actually checks that there are no closing round brackets after the hyphenated numbers.

See demo

Sample code:

import re
p = re.compile(ur'(?![^()]*\))(\d+-\d+)')
test_str = u"Word word word; 123-125\nWord word (1000-1000)\nWord word word (1000-1000); 99-999\nWord word word word"
re.findall(p, test_str)

Output of the sample program:

[u'123-125', u'99-999']

answered Apr 28, 2015 at 10:41

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

André Over a year ago

Thank you so much for your complete answer! Much appreciated. Could you explain what is the function of the following part, please? [^()]*

Wiktor Stribiżew Over a year ago

[^()]* is a negated character class that means match any number of characters other than literal ( and ).

André Over a year ago

Thanks for clarifying that, @stribizhev. What if I wanted to exclude \d+-\d+ that come after a specific string, like ABC 1234-1234, or ABC: 1234-1234? I tried this: (?<!(ABC |ABC: ))(?![^()]*\))(\d+-\d+) but that didn't work at all. The error I got was: look-behind requires fixed-width pattern.

Wiktor Stribiżew Over a year ago

It is not possible to use variable-width look-behind with the standard re library. However, in this case, it is possible to use 2: (?![^()]*\))(?<!ABC )(?<!ABC: )(\d{2}-\d+). Mind that here we need to limit the first digits to {2}, or we'll still get a match.

Casimir et Hippolyte · Accepted Answer · 2015-04-28 10:57:04Z

2

A way consists to describe all you don't want:

[^(\d]*(?:\([^)]*\)[^(\d]*)*

Then you can use an always true assertion: a digits are always preceded by zero or more characters that are not digits and characters between quotes.

You only need to capture the digits in a group:

p = re.compile(r'[^(\d]*(?:\([^)]*\)[^(\d]*)*(\d+-\d+)')

The advantage of this way is that you don't need to test a lookahead at each position in the string, so it is a fast pattern. The inconvenient is that it consumes a little more memory, because the whole match produces more long strings.

answered Apr 28, 2015 at 10:57

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Collectives™ on Stack Overflow

How to extract numbers from string using regular expression in Python except when within brackets?

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related