Filtering DataFrames in Pandas for multiple columns where a column name contains a pattern

Question

While filtering multiple columns I have seen examples where we could filter Rows using something like this df[df['A'].str.contains("string") | df['B'].str.contains("string")] .

I have multiple files where I want to fetch each file and get only those rows with 'gmail.com' from the column names having 'email' string in them.

So an example header can be like: 'firstname' 'lastname' 'companyname' 'address' 'emailid1' 'emailid2' 'emailid3' ...

The columns emailid1..2..3 have emailids containing gmail.com. I would want to fetch rows where gmail can occur in any one of them.

for file in files:
    pdf = pd.read_csv('Reduced/'+file,delimiter = '\t')
    emailids = [col for col in pdf.columns if 'email' in col]
    #  pdf['gmail' in pdf[emailids]]

Jon Clements · Accepted Answer · 2016-09-06 11:56:18Z

1

Given example input of:

df = pd.DataFrame({'email': ['[email protected]', '[email protected]'], 'somethingelse': [1, 2], 'another_email': ['[email protected]', '[email protected]']})

eg:

           another_email              email  somethingelse
0   [email protected]   [email protected]              1
1  [email protected]  [email protected]              2

You can filter out the columns that contain email, look for gmail.com or whatever text you wish, then subset, eg:

df[df.filter(like='email').applymap(lambda L: 'gmail.com' in L).any(axis=1)]

Which gives you:

           another_email              email  somethingelse
1  [email protected]  [email protected]              2

answered Sep 6, 2016 at 11:56

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jezrael · Accepted Answer · 2016-09-06 11:53:18Z

You can use any with boolean indexing:

pdf = pd.DataFrame({'A':[1,2,3],
                   'email1':['gmail.com','t','f'],
                   'email2':['u','gmail.com','t'],
                   'D':[1,3,5],
                   'E':[5,3,6],
                   'F':[7,4,3]})
print (pdf)
   A  D  E  F     email1     email2
0  1  1  5  7  gmail.com          u
1  2  3  3  4          t  gmail.com
2  3  5  6  3          f          t

#filter column names                   
emailids = [col for col in pdf.columns if 'email' in col]
print (emailids)
['email1', 'email2']

#apply string function for each filtered column
df = pd.concat([pdf[col].str.contains('gmail.com') for col in pdf[emailids]], axis=1)

print (df)
  email1 email2
0   True  False
1  False   True
2  False  False

#filter at least one True by any
print (pdf[df.any(1)])
   A  D  E  F     email1     email2
0  1  1  5  7  gmail.com          u
1  2  3  3  4          t  gmail.com

Collectives™ on Stack Overflow

Filtering DataFrames in Pandas for multiple columns where a column name contains a pattern

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related