Python Not Extracting Expected Pattern

Question

I'm new to RegEx and I am trying to perform a simple match to extract a list of items using re.findall. However, I am not getting the expected result. Can you please help explain why I am also getting the first piece of this string based on the below regex pattern and what I need to modify to get the desired output?

import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''

print(re.findall('_\w+_\w+_bar_\d+', string))

Current Output:

['_1y345_xyz_orange_bar_1', '_123a5542_xyz_orange_bar_1', '_1z34512_abc_purple_bar_1']

Desired Output:

['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']

Wiktor Stribiżew · Accepted Answer · 2017-07-16 21:26:22Z

3

The \w pattern matches letters, digits and _ symbol. Depending on the Python version and options used, the letters and digits it can match may be from the whole Unicode range or just ASCII.

So, the best way to fix the issue is by replacing \w with [^\W_]:

import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall(r'_[^\W_]+_[^\W_]+_bar_[0-9]+', string))
# => ['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']

See the Python demo.

Details:

_ - an underscore
[^\W_]+ - 1 or more chars that are either digits or letters (a [^ starts the negated character class, \W matches any non-word char, and _ is added to match any word chars other than _)
_[^\W_]+ - same as above
_bar_ - a literal substring _bar_
[0-9]+ - 1 or more ASCII digits.

See the regex demo.

edited Jul 16, 2017 at 21:26

answered Jul 16, 2017 at 20:15

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MBasith Over a year ago

Thanks for the detailed explanation.

Piyush · Accepted Answer · 2017-07-16 20:13:55Z

2

_[a-z]+_\w+_bar_\d+ should work.

import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_[a-z]+_\w+_bar_\d+', string))

o/p

['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']

answered Jul 16, 2017 at 20:13

Piyush

1,1629 silver badges17 bronze badges

Comments

JohanL · Accepted Answer · 2017-07-16 20:15:21Z

2

Your problem is that the regular expression is greedy and tries to match as much as possible. Sometimes this can be fixed by adding a ? (question mark) after the + (plus) sign. However, in your current solution that is not doable (in any simple way, at least - it can likely be done with some lookahead). However, you can choose another pattern, that explicitly forbids matching then _ (underline) character as:

import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''

print(re.findall('_[^_\W]+_[^_\W]+_bar_\d+', string))

This will match what you hope for. The [^ ... ] construct means not, thus not underline and not not whitespace.

answered Jul 16, 2017 at 20:15

JohanL

6,9211 gold badge16 silver badges32 bronze badges

Comments

taras · Accepted Answer · 2017-07-16 20:17:35Z

2

The problem with your code is that \w pattern is equivalent to the following set of characters: [a-zA-Z0-9_]

I guess you need to match the same set but without an underscore:

import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''

print(re.findall('_[a-zA-Z0-9]+_[a-zA-Z0-9]+_bar_\d+', string))

The output:

['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']

edited Jul 16, 2017 at 20:17

answered Jul 16, 2017 at 20:13

taras

6,93510 gold badges46 silver badges54 bronze badges

Comments

Soviut · Accepted Answer · 2017-07-16 20:31:17Z

2

Your \w usage is too permissive. It will find not only letters, but numbers and underscores as well. From the docs:

When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

Instead us actual character groupings to match.

_[a-z]+_[a-z]+_bar_[0-9]+

If you actually need the complete matching of \w without the underscore, you can change the character groupings to:

 [a-zA-Z0-9]

edited Jul 16, 2017 at 20:31

answered Jul 16, 2017 at 20:07

Soviut

92.2k53 gold badges210 silver badges285 bronze badges

3 Comments

cs95 Over a year ago

string? Since when?

taras Over a year ago

string is a name of standard library module but it can be safely used as the variable name.

Soviut Over a year ago

Regardless, better to name things after what they are, not their type, to avoid any possible collisions.

Collectives™ on Stack Overflow

Python Not Extracting Expected Pattern

5 Answers 5

1 Comment

Comments

Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related