1

While filtering multiple columns I have seen examples where we could filter Rows using something like this df[df['A'].str.contains("string") | df['B'].str.contains("string")] .

I have multiple files where I want to fetch each file and get only those rows with 'gmail.com' from the column names having 'email' string in them.

So an example header can be like: 'firstname' 'lastname' 'companyname' 'address' 'emailid1' 'emailid2' 'emailid3' ...

The columns emailid1..2..3 have emailids containing gmail.com. I would want to fetch rows where gmail can occur in any one of them.

for file in files:
    pdf = pd.read_csv('Reduced/'+file,delimiter = '\t')
    emailids = [col for col in pdf.columns if 'email' in col]
    #  pdf['gmail' in pdf[emailids]]

2 Answers 2

1

Given example input of:

df = pd.DataFrame({'email': ['[email protected]', '[email protected]'], 'somethingelse': [1, 2], 'another_email': ['[email protected]', '[email protected]']})

eg:

           another_email              email  somethingelse
0   [email protected]   [email protected]              1
1  [email protected]  [email protected]              2

You can filter out the columns that contain email, look for gmail.com or whatever text you wish, then subset, eg:

df[df.filter(like='email').applymap(lambda L: 'gmail.com' in L).any(axis=1)]

Which gives you:

           another_email              email  somethingelse
1  [email protected]  [email protected]              2
Sign up to request clarification or add additional context in comments.

Comments

1

You can use any with boolean indexing:

pdf = pd.DataFrame({'A':[1,2,3],
                   'email1':['gmail.com','t','f'],
                   'email2':['u','gmail.com','t'],
                   'D':[1,3,5],
                   'E':[5,3,6],
                   'F':[7,4,3]})
print (pdf)
   A  D  E  F     email1     email2
0  1  1  5  7  gmail.com          u
1  2  3  3  4          t  gmail.com
2  3  5  6  3          f          t

#filter column names                   
emailids = [col for col in pdf.columns if 'email' in col]
print (emailids)
['email1', 'email2']

#apply string function for each filtered column
df = pd.concat([pdf[col].str.contains('gmail.com') for col in pdf[emailids]], axis=1)

print (df)
  email1 email2
0   True  False
1  False   True
2  False  False

#filter at least one True by any
print (pdf[df.any(1)])
   A  D  E  F     email1     email2
0  1  1  5  7  gmail.com          u
1  2  3  3  4          t  gmail.com

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.