Filter Pandas dataframe by column name on regex patterns using str.contains

Question

I want to find columns in a dataframe that match a string pattern. I specifically want to find two parts, firstly find a column that contains "WORDABC" and then I want to find the column that also is the "1" value of that column (i.e. "WORDABC1"). To do this I have been using the .str.contains Pandas function.

My problem is when there are two numbers, such as "11" or "13".

df = pd.DataFrame({'WORDABC1': {0: 1, 1: 2, 2: 3},
 'WORDABC11': {0: 4, 1: 5, 2: 6},
 'WORDABC8N123': {0: 7, 1: 8, 2: 9},
 'WORDABC81N123': {0: 10, 1: 11, 2: 12},
 'WORDABC9N123': {0: 13, 1: 14, 2: 15},
 'WORDABC99N123': {0: 16, 1: 17, 2: 18}})

Trying to search for the column that contains "WORDABC1" gives two results, "WORDABC1" and

df[df.columns[df.columns.str.contains(pat = 'WORDABC1')]]

   WORDABC1  WORDABC11
0         1          4
1         2          5
2         3          6

df[df.columns[df.columns.str.contains(pat = 'WORDABC1\\b')]]

   WORDABC1
0         1
1         2
2         3

For the above example, it works for me. However my problem happens if there are more characters after my found pattern.

df[df.columns[df.columns.str.contains(pat = 'WORDABC9')]]
   WORDABC9N123  WORDABC99N123
0            13             16
1            14             17
2            15             18

df[df.columns[df.columns.str.contains(pat = 'WORDABC9\\b')]]
Empty DataFrame
Columns: []
Index: [0, 1, 2]

I only want the "WORDABC9N123" column, and I cannot just remove the other column. I have considered just using df[df.columns[df.columns.str.contains(pat = 'WORDABC9')][0]] to get the series I want, but that creates another issue.

I have also been using things such as (df.columns.str.contains(pat = 'WORDABC1\\b')).sum() to create truth statements, so the above df[0] method doesn't help me get through the issue.

Is there a better method to use instead of str.contains? Or is my regex just incorrect? Thank you!

It's unclear what the rules are. Is it just that there can be no additional numbers at the end of the pattern? — Henry Ecker
– Henry Ecker ♦, Commented Aug 20, 2021 at 21:23

Andrej Kesely · Accepted Answer · 2021-08-20 21:31:12Z

6

Try .filter with regex= parameter:

print(df.filter(regex=r"WORDABC9(?=[^\d]|$)"))

Prints:

   WORDABC9N123
0            13
1            14
2            15

answered Aug 20, 2021 at 21:31

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hoobs52 Over a year ago

Thank you! This worked for me. So for this expression, after finding WORDABC9, we do a positive lookahead where we look for 1 character that is not a digit, or its the end of the line. That makes a lot of sense. I will continue to practice more regex!

Bill the Lizard · Accepted Answer · 2021-08-20 21:32:50Z

1

pat = 'WORDABC1\\b' works when matching 'WORDABC1' because \\b matches word boundaries, and the end of a string is a word boundary.

If you want to match 'WORDABC9N123' but not 'WORDABC99N123', the similar pattern 'WORDABC9\\b' will not work because there is no word boundary in either case.

I think you want to match WORDABC9 followed by a non-digit, in which case you can try pat = 'WORDABC9[\\b | \\D]'. That will match either WORDABC9 or WORDABC9N..., but not WORDABC99N123

answered Aug 20, 2021 at 21:32

Bill the Lizard

407k213 gold badges579 silver badges892 bronze badges

Collectives™ on Stack Overflow

Filter Pandas dataframe by column name on regex patterns using str.contains

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related