2

I'm new to RegEx and I am trying to perform a simple match to extract a list of items using re.findall. However, I am not getting the expected result. Can you please help explain why I am also getting the first piece of this string based on the below regex pattern and what I need to modify to get the desired output?

import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''

print(re.findall('_\w+_\w+_bar_\d+', string))

Current Output:

['_1y345_xyz_orange_bar_1', '_123a5542_xyz_orange_bar_1', '_1z34512_abc_purple_bar_1']

Desired Output:

['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']

5 Answers 5

3

The \w pattern matches letters, digits and _ symbol. Depending on the Python version and options used, the letters and digits it can match may be from the whole Unicode range or just ASCII.

So, the best way to fix the issue is by replacing \w with [^\W_]:

import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall(r'_[^\W_]+_[^\W_]+_bar_[0-9]+', string))
# => ['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']

See the Python demo.

Details:

  • _ - an underscore
  • [^\W_]+ - 1 or more chars that are either digits or letters (a [^ starts the negated character class, \W matches any non-word char, and _ is added to match any word chars other than _)
  • _[^\W_]+ - same as above
  • _bar_ - a literal substring _bar_
  • [0-9]+ - 1 or more ASCII digits.

See the regex demo.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the detailed explanation.
2

_[a-z]+_\w+_bar_\d+ should work.

import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_[a-z]+_\w+_bar_\d+', string))

o/p

['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']

Comments

2

Your problem is that the regular expression is greedy and tries to match as much as possible. Sometimes this can be fixed by adding a ? (question mark) after the + (plus) sign. However, in your current solution that is not doable (in any simple way, at least - it can likely be done with some lookahead). However, you can choose another pattern, that explicitly forbids matching then _ (underline) character as:

import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''

print(re.findall('_[^_\W]+_[^_\W]+_bar_\d+', string))

This will match what you hope for. The [^ ... ] construct means not, thus not underline and not not whitespace.

Comments

2

The problem with your code is that \w pattern is equivalent to the following set of characters: [a-zA-Z0-9_]

I guess you need to match the same set but without an underscore:

import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''

print(re.findall('_[a-zA-Z0-9]+_[a-zA-Z0-9]+_bar_\d+', string))

The output:

['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']

Comments

2

Your \w usage is too permissive. It will find not only letters, but numbers and underscores as well. From the docs:

When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

Instead us actual character groupings to match.

_[a-z]+_[a-z]+_bar_[0-9]+

If you actually need the complete matching of \w without the underscore, you can change the character groupings to:

 [a-zA-Z0-9]

3 Comments

string? Since when?
string is a name of standard library module but it can be safely used as the variable name.
Regardless, better to name things after what they are, not their type, to avoid any possible collisions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.