3

I want to find columns in a dataframe that match a string pattern. I specifically want to find two parts, firstly find a column that contains "WORDABC" and then I want to find the column that also is the "1" value of that column (i.e. "WORDABC1"). To do this I have been using the .str.contains Pandas function.

My problem is when there are two numbers, such as "11" or "13".

df = pd.DataFrame({'WORDABC1': {0: 1, 1: 2, 2: 3},
 'WORDABC11': {0: 4, 1: 5, 2: 6},
 'WORDABC8N123': {0: 7, 1: 8, 2: 9},
 'WORDABC81N123': {0: 10, 1: 11, 2: 12},
 'WORDABC9N123': {0: 13, 1: 14, 2: 15},
 'WORDABC99N123': {0: 16, 1: 17, 2: 18}})

Trying to search for the column that contains "WORDABC1" gives two results, "WORDABC1" and

df[df.columns[df.columns.str.contains(pat = 'WORDABC1')]]

   WORDABC1  WORDABC11
0         1          4
1         2          5
2         3          6
df[df.columns[df.columns.str.contains(pat = 'WORDABC1\\b')]]

   WORDABC1
0         1
1         2
2         3

For the above example, it works for me. However my problem happens if there are more characters after my found pattern.

df[df.columns[df.columns.str.contains(pat = 'WORDABC9')]]
   WORDABC9N123  WORDABC99N123
0            13             16
1            14             17
2            15             18

df[df.columns[df.columns.str.contains(pat = 'WORDABC9\\b')]]
Empty DataFrame
Columns: []
Index: [0, 1, 2]

I only want the "WORDABC9N123" column, and I cannot just remove the other column. I have considered just using df[df.columns[df.columns.str.contains(pat = 'WORDABC9')][0]] to get the series I want, but that creates another issue.

I have also been using things such as (df.columns.str.contains(pat = 'WORDABC1\\b')).sum() to create truth statements, so the above df[0] method doesn't help me get through the issue.

Is there a better method to use instead of str.contains? Or is my regex just incorrect? Thank you!

1
  • It's unclear what the rules are. Is it just that there can be no additional numbers at the end of the pattern? Commented Aug 20, 2021 at 21:23

2 Answers 2

6

Try .filter with regex= parameter:

print(df.filter(regex=r"WORDABC9(?=[^\d]|$)"))

Prints:

   WORDABC9N123
0            13
1            14
2            15
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! This worked for me. So for this expression, after finding WORDABC9, we do a positive lookahead where we look for 1 character that is not a digit, or its the end of the line. That makes a lot of sense. I will continue to practice more regex!
1

pat = 'WORDABC1\\b' works when matching 'WORDABC1' because \\b matches word boundaries, and the end of a string is a word boundary.

If you want to match 'WORDABC9N123' but not 'WORDABC99N123', the similar pattern 'WORDABC9\\b' will not work because there is no word boundary in either case.

I think you want to match WORDABC9 followed by a non-digit, in which case you can try pat = 'WORDABC9[\\b | \\D]'. That will match either WORDABC9 or WORDABC9N..., but not WORDABC99N123

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.