1

I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:

1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):

df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids

where

acids

is the series containing the identifiers.

However, this gives me a

TypeError: unhashable type

2) I tried filtering using the apply function:

df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]

This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.

3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):

df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]

But again, the dataframe doesn't change.

I hope this makes sense...

Any suggestions where I might be going wrong? Thanks, Anne

3
  • 1
    Could you post a small sample of what your dataframe and series and what you expect your results to look like? Commented Jul 11, 2013 at 15:21
  • These operations are not inplace, so the dataframe won't just change unless you explicitly tell it to (this is a good thing). Commented Jul 11, 2013 at 15:30
  • Hey Andy, thanks a lot, if I add df = ... in the third solution it works. Commented Jul 11, 2013 at 15:40

1 Answer 1

3

I think you're asking for something like the following:

In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])

In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})

In [3]: df
Out[3]: 
  ids  vals
0   a     1
1   b     2
2   c     3
3   f     4

In [4]: other_ids
Out[4]: 
0    a
1    b
2    c
3    c
dtype: object

In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().

In [5]: df.ids.isin(other_ids)
Out[5]: 
0     True
1     True
2     True
3    False
Name: ids, dtype: bool

This gives a column of bools that we can index into:

In [6]: df[df.ids.isin(other_ids)]
Out[6]: 
  ids  vals
0   a     1
1   b     2
2   c     3

This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.

Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:

In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'], 
'ids2': ['e', 'f', 'c', 'f']})

In [27]: df
Out[27]: 
  ids ids2  vals
0   a    e     1
1   b    f     2
2   f    c     3
3   f    f     4

In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]: 
0     True
1     True
2     True
3    False
dtype: bool

True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:

In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]

In [30]: new
Out[30]: 
  ids ids2  vals
0   a    e     1
1   b    f     2
2   f    c     3
Sign up to request clarification or add additional context in comments.

8 Comments

I think the main confusion from this question is that these operations are not inplace (ie you need to set df = df[df.ids.isin(other_ids)])
I think that's right. The other issue might be that she has two columns for ids, where I have one. Is there a reason that dataframe doesn't have an isin() method with some options like or and and?
Nah that should be ok. I think it's more efficient to do the or/and afterwards, so pandas takes an executive decision that you should do that.
I'll open a GH issue. No promise on a PR though. I really need to get back to work :)
Hey Tom, thanks a lot for the example. I have extended it to what I need: Try new_ids = pd.Series(["a1,c3"]) and df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f'], 'stuff':["insert", "whatever","you","fancy"]}). I would like to filter the rows by appending the string in columns "ids" and "vals" to match what is in new_ids... Preferably without creating an additional column first, as I am doing currently.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.