How can one filter a dataframe based on rows containing specific value (in any of the columns)

Question

I need to limit a dataset so that it returns only rows that contain specific string, however, that string can exist in many (8) of the columns.

How can I do this? Ive seen str.isin methods, but it returns a single series for a single row. How can I remove any rows that contain the string in ANY of the columns.

Example code If I had the dataframe df generated by

 import pandas as pd
    data = {'year': [2011, 2012, 2013, 2014, 2014, 2011, 2012, 2015], 
        'year2': [2012, 2016, 2015, 2015, 2012, 2013, 2019, 2016],
        'reports': [52, 20, 43, 33, 41, 11, 43, 72]}
    df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
    df    

   year  year2  reports
a  2011   2012       52
b  2012   2016       20
c  2013   2015       43
d  2014   2015       33
e  2014   2012       41
f  2011   2013       11
g  2012   2019       43
h  2015   2016       72

I want the code to remove rows all rows that do not contain the value 2012. Note that in my actual dataset, it is a string, not an int (it is peoples names) so in the above code it would remove rows c, d, f, and h.

Editted the post to be more specific, no I am not trying to drop known rows. The actual dataset is almost 80,000 rows and I need to filter to only find data involved with a single person, whose name may be contained in 8 possible rows — AlbinoRhino
– AlbinoRhino, Commented Jan 10, 2020 at 17:57
Take a look at stackoverflow.com/a/35682788/12411517. You can remove using ~ or compare inequality. — sd191028
– sd191028, Commented Jan 10, 2020 at 18:02

anky · Accepted Answer · 2020-01-10 18:01:52Z

9

you can use df.eq with df.any on axis=1:

df[df.eq('2012').any(1)] #for year as string

Or:

df[df.eq(2012).any(1)] #for year as int

   year  year2  reports
a  2011   2012       52
b  2012   2016       20
e  2014   2012       41
g  2012   2019       43

edited Jan 10, 2020 at 18:01

answered Jan 10, 2020 at 17:59

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

AlbinoRhino Over a year ago

This code works, ALTHOUGH to any future readers please note that the answer snippet searches for a string (which is what my real dataset has) so it will not return the correct results on my example code where 2012 is an int

AlbinoRhino Over a year ago

Ill be accepting this answer when the time allows, thank you so much for the quick response and the edits

jfaccioni Over a year ago

To search for '2012' in specific columns only, use: df[df.loc[:, columns].eq('2012').any(1)], given that columns is a list of columns in which to search (e.g. columns = ['year', 'year2'])

AlbinoRhino Over a year ago

How would you go about modifying this to drop any rows which do not contain a substring. (rows in which no value satisfies 'string' is in values) say for example I have a large dataset with names but I want to return all rows which contain the name george, but that may include different last names (for example, column 3 may be george foreman or george brazil, but i want both returned)

anky Over a year ago

@AlbinoRhino may be df.apply(lambda x: x.str.contains('george')).any(1) ? if not that does call for a different question since this is asking for a substring match and not an exact match

Zakiul Fahmi Jailani · Accepted Answer · 2020-01-10 18:01:14Z

0

try simple code like this:

import pandas as pd
data = {'year': [2011, 2012, 2013, 2014, 2014, 2011, 2012, 2015], 
'year2': [2012, 2016, 2015, 2015, 2012, 2013, 2019, 2016],
'reports': [52, 20, 43, 33, 41, 11, 43, 72]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df = df.drop(['c', 'd', 'f', 'h'])

df

it will give you dataframe like this:

   year  year2  reports
a  2011   2012       52
b  2012   2016       20
e  2014   2012       41
g  2012   2019       43

answered Jan 10, 2020 at 18:01

Zakiul Fahmi Jailani

134 bronze badges

1 Comment

AlbinoRhino Over a year ago

I am not dropping known rows, but instead finding rows which lack a certain value in any of their columns, then removing them

sd191028 · Accepted Answer · 2020-01-14 15:02:28Z

0

To find the dataframe made of the rows that have the value

df[(df == '2012').all(axis=1)]

To find the dataframe made of the rows that do not have the value

df[~(df == '2012').all(axis=1)]

or

df[(df != '2012').all(axis=1)]

See the related https://stackoverflow.com/a/35682788/12411517.

edited Jan 14, 2020 at 15:02

answered Jan 10, 2020 at 18:13

sd191028

784 bronze badges

2 Comments

AlbinoRhino Over a year ago

dataframe objects cannot use contains. If you edit this post so that it doesnt have that ill remove the downvote of course. Also if you can find a working method id give the upvote.

sd191028 Over a year ago

@AlbinoRhino sorry about that. removed.

Collectives™ on Stack Overflow

How can one filter a dataframe based on rows containing specific value (in any of the columns)

3 Answers 3

5 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related