2

Pandas seems to be promoting an int to a float when filtering. I've provided a simple snippet below but I've got a much more complex example which I believe this promotion leads to incorrect filtering because it compares floats. Is there a way around this? I read that this is a change of behaviour between different versions of pandas - it certainly didn't use to be the case.

Below you can see, it changes [4 13] and [5 14] to [4.0 13.0] and [5.0 14.0].

In [53]: df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})  
    ...: df2 = pd.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})                                                                                             

In [54]: df1                                                                                                                                                                
Out[54]: 
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

In [55]: df2                                                                                                                                                                
Out[55]: 
   col1  col2
0     1    10
1     2    11
2     3    12

In [56]: df1[~df1.isin(df2)]                                                                                                                                                
Out[56]: 
   col1  col2
0   NaN   NaN
1   NaN   NaN
2   NaN   NaN
3   4.0  13.0
4   5.0  14.0

In [57]: df1[~df1.isin(df2)].dropna()                                                                                                                                       
Out[57]: 
   col1  col2
3   4.0  13.0
4   5.0  14.0

In [58]: df1[~df1.isin(df2)].dtypes                                                                                                                                         
Out[58]: 
col1    float64
col2    float64
dtype: object

In [59]: df1.dtypes                                                                                                                                                         
Out[59]: 
col1    int64
col2    int64
dtype: object

In [60]: df2.dtypes                                                                                                                                                         
Out[60]: 
col1    int64
col2    int64
dtype: object
1
  • 3
    It's not because of float comparison, it's because of the NaN's. You could use the Int64 dtype which has integer NaN's if you wish. Commented Nov 1, 2019 at 15:55

1 Answer 1

1

There is no float comparison happening here. isin is returning NaN's for missing data, and since you are using numpy's int64, the result is getting cast to float64.

In 0.24, pandas added a nullable integer dtype, which you can use here.


df1 = df1.astype('Int64')
df2 = df2.astype('Int64')

df1[~df1.isin(df2)]

   col1  col2
0   NaN   NaN
1   NaN   NaN
2   NaN   NaN
3     4    13
4     5    14

Just be aware that if you wanted to use numpy operations on the result, numpy would treat the above as an array with dtype object.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.