Pandas promotes int to float when filtering

Question

Pandas seems to be promoting an int to a float when filtering. I've provided a simple snippet below but I've got a much more complex example which I believe this promotion leads to incorrect filtering because it compares floats. Is there a way around this? I read that this is a change of behaviour between different versions of pandas - it certainly didn't use to be the case.

Below you can see, it changes [4 13] and [5 14] to [4.0 13.0] and [5.0 14.0].

In [53]: df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})  
    ...: df2 = pd.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})                                                                                             

In [54]: df1                                                                                                                                                                
Out[54]: 
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

In [55]: df2                                                                                                                                                                
Out[55]: 
   col1  col2
0     1    10
1     2    11
2     3    12

In [56]: df1[~df1.isin(df2)]                                                                                                                                                
Out[56]: 
   col1  col2
0   NaN   NaN
1   NaN   NaN
2   NaN   NaN
3   4.0  13.0
4   5.0  14.0

In [57]: df1[~df1.isin(df2)].dropna()                                                                                                                                       
Out[57]: 
   col1  col2
3   4.0  13.0
4   5.0  14.0

In [58]: df1[~df1.isin(df2)].dtypes                                                                                                                                         
Out[58]: 
col1    float64
col2    float64
dtype: object

In [59]: df1.dtypes                                                                                                                                                         
Out[59]: 
col1    int64
col2    int64
dtype: object

In [60]: df2.dtypes                                                                                                                                                         
Out[60]: 
col1    int64
col2    int64
dtype: object

It's not because of float comparison, it's because of the NaN's. You could use the Int64 dtype which has integer NaN's if you wish. — user3483203
– user3483203, Commented Nov 1, 2019 at 15:55

user3483203 · Accepted Answer · 2019-11-01 15:57:53Z

1

There is no float comparison happening here. isin is returning NaN's for missing data, and since you are using numpy's int64, the result is getting cast to float64.

In 0.24, pandas added a nullable integer dtype, which you can use here.

df1 = df1.astype('Int64')
df2 = df2.astype('Int64')

df1[~df1.isin(df2)]

   col1  col2
0   NaN   NaN
1   NaN   NaN
2   NaN   NaN
3     4    13
4     5    14

Just be aware that if you wanted to use numpy operations on the result, numpy would treat the above as an array with dtype object.

answered Nov 1, 2019 at 15:57

user3483203

51.3k10 gold badges72 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas promotes int to float when filtering

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related