Filtering rows from pandas dataframe using concatenated strings

Question

I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:

1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):

df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids

where

acids

is the series containing the identifiers.

However, this gives me a

TypeError: unhashable type

2) I tried filtering using the apply function:

df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]

This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.

3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):

df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]

But again, the dataframe doesn't change.

I hope this makes sense...

Any suggestions where I might be going wrong? Thanks, Anne

Could you post a small sample of what your dataframe and series and what you expect your results to look like? — TomAugspurger
– TomAugspurger, Commented Jul 11, 2013 at 15:21
These operations are not inplace, so the dataframe won't just change unless you explicitly tell it to (this is a good thing). — Andy Hayden
– Andy Hayden, Commented Jul 11, 2013 at 15:30
Hey Andy, thanks a lot, if I add df = ... in the third solution it works. — Anne
– Anne, Commented Jul 11, 2013 at 15:40

TomAugspurger · Accepted Answer · 2013-07-11 15:38:40Z

3

I think you're asking for something like the following:

In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])

In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})

In [3]: df
Out[3]: 
  ids  vals
0   a     1
1   b     2
2   c     3
3   f     4

In [4]: other_ids
Out[4]: 
0    a
1    b
2    c
3    c
dtype: object

In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().

In [5]: df.ids.isin(other_ids)
Out[5]: 
0     True
1     True
2     True
3    False
Name: ids, dtype: bool

This gives a column of bools that we can index into:

In [6]: df[df.ids.isin(other_ids)]
Out[6]: 
  ids  vals
0   a     1
1   b     2
2   c     3

This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.

Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:

In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'], 
'ids2': ['e', 'f', 'c', 'f']})

In [27]: df
Out[27]: 
  ids ids2  vals
0   a    e     1
1   b    f     2
2   f    c     3
3   f    f     4

In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]: 
0     True
1     True
2     True
3    False
dtype: bool

True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:

In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]

In [30]: new
Out[30]: 
  ids ids2  vals
0   a    e     1
1   b    f     2
2   f    c     3

edited Jul 11, 2013 at 15:38

answered Jul 11, 2013 at 15:28

TomAugspurger

29k8 gold badges90 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Andy Hayden Over a year ago

I think the main confusion from this question is that these operations are not inplace (ie you need to set df = df[df.ids.isin(other_ids)])

TomAugspurger Over a year ago

I think that's right. The other issue might be that she has two columns for ids, where I have one. Is there a reason that dataframe doesn't have an isin() method with some options like or and and?

Andy Hayden Over a year ago

Nah that should be ok. I think it's more efficient to do the or/and afterwards, so pandas takes an executive decision that you should do that.

TomAugspurger Over a year ago

I'll open a GH issue. No promise on a PR though. I really need to get back to work :)

Anne Over a year ago

Hey Tom, thanks a lot for the example. I have extended it to what I need: Try new_ids = pd.Series(["a1,c3"]) and df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f'], 'stuff':["insert", "whatever","you","fancy"]}). I would like to filter the rows by appending the string in columns "ids" and "vals" to match what is in new_ids... Preferably without creating an additional column first, as I am doing currently.

|

Collectives™ on Stack Overflow

Filtering rows from pandas dataframe using concatenated strings

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related