Is there a better way to determine whether a variable in Pandas and/or NumPy is numeric or not ?
I have a self defined dictionary with dtypes as keys and numeric / not as values.
In pandas 0.20.2 you can do:
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [1.0, 2.0, 3.0]})
is_string_dtype(df['A'])
>>>> True
is_numeric_dtype(df['B'])
>>>> True
is_numeric_dtype returns True for boolean type as well.is_numeric_dtype returns Falseis_integer_dtype is also useful.You can use np.issubdtype to check if the dtype is a sub dtype of np.number. Examples:
np.issubdtype(arr.dtype, np.number) # where arr is a numpy array
np.issubdtype(df['X'].dtype, np.number) # where df['X'] is a pandas Series
This works for numpy's dtypes but fails for pandas specific types like pd.Categorical as Thomas noted. If you are using categoricals is_numeric_dtype function from pandas is a better alternative than np.issubdtype.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0],
'C': [1j, 2j, 3j], 'D': ['a', 'b', 'c']})
df
Out:
A B C D
0 1 1.0 1j a
1 2 2.0 2j b
2 3 3.0 3j c
df.dtypes
Out:
A int64
B float64
C complex128
D object
dtype: object
np.issubdtype(df['A'].dtype, np.number)
Out: True
np.issubdtype(df['B'].dtype, np.number)
Out: True
np.issubdtype(df['C'].dtype, np.number)
Out: True
np.issubdtype(df['D'].dtype, np.number)
Out: False
For multiple columns you can use np.vectorize:
is_number = np.vectorize(lambda x: np.issubdtype(x, np.number))
is_number(df.dtypes)
Out: array([ True, True, True, False], dtype=bool)
And for selection, pandas now has select_dtypes:
df.select_dtypes(include=[np.number])
Out:
A B C
0 1 1.0 1j
1 2 2.0 2j
2 3 3.0 3j
Based on @jaime's answer in the comments, you need to check .dtype.kind for the column of interest. For example;
>>> import pandas as pd
>>> df = pd.DataFrame({'numeric': [1, 2, 3], 'not_numeric': ['A', 'B', 'C']})
>>> df['numeric'].dtype.kind in 'biufc'
>>> True
>>> df['not_numeric'].dtype.kind in 'biufc'
>>> False
NB The meaning of biufc: b bool, i int (signed), u unsigned int, f float, c complex. See https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.kind.html#numpy.dtype.kind
u is for unsigned integer; uppercase U is for unicode. [1]: docs.scipy.org/doc/numpy/reference/generated/…DataFrames have the select_dtypes method. This will return a subset of the DataFrame which only includes the "numeric" columns (columns of dtype int64/float64).
df.select_dtypes(include=['int64', 'float64'])
include=np.number (import numpy as np)This is a pseudo-internal method to return only the numeric type data
In [27]: df = DataFrame(dict(A = np.arange(3),
B = np.random.randn(3),
C = ['foo','bar','bah'],
D = Timestamp('20130101')))
In [28]: df
Out[28]:
A B C D
0 0 -0.667672 foo 2013-01-01 00:00:00
1 1 0.811300 bar 2013-01-01 00:00:00
2 2 2.020402 bah 2013-01-01 00:00:00
In [29]: df.dtypes
Out[29]:
A int64
B float64
C object
D datetime64[ns]
dtype: object
In [30]: df._get_numeric_data()
Out[30]:
A B
0 0 -0.667672
1 1 0.811300
2 2 2.020402
How about just checking type for one of the values in the column? We've always had something like this:
isinstance(x, (int, long, float, complex))
When I try to check the datatypes for the columns in below dataframe, I get them as 'object' and not a numerical type I'm expecting:
df = pd.DataFrame(columns=('time', 'test1', 'test2'))
for i in range(20):
df.loc[i] = [datetime.now() - timedelta(hours=i*1000),i*10,i*100]
df.dtypes
time datetime64[ns]
test1 object
test2 object
dtype: object
When I do the following, it seems to give me accurate result:
isinstance(df['test1'][len(df['test1'])-1], (int, long, float, complex))
returns
True
Just to add to all other answers, one can also use df.info() to get whats the data type of each column.
df.dtypesAssuming you want to keep your data in the same type, I found the following works similar to df._get_numeric_data():
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [1.0, 2.0, 3.0],
'C': [4.0, 'x2', 6], 'D': [np.nan]*3})
test_dtype_df = df.loc[:, df.apply(lambda s: s.dtype.kind in 'biufc')]
test_dtype_df.shape == df._get_numeric_data().shape
Out[1]: True
However, if you want to test whether a series converts properly, you can use "ignore" :
df_ = df.copy().apply(pd.to_numeric, errors='ignore')
test_nmr_ignore = df_.loc[:, df_.apply(lambda s: s.dtype.kind in 'biufc')]
display(test_nmr_ignore)
test_nmr_ignore.shape == df._get_numeric_data().shape,\
test_nmr_ignore.shape == df_._get_numeric_data().shape,\
test_nmr_ignore.shape
B D
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
Out[2]: (True, True, (3, 2))
Finally, in the case where some data is mixed, you can use coerce with the pd.to_numeric function, and then drop columns that are filled completely with np.nan values.
df_ = df.copy().apply(pd.to_numeric, errors='coerce')
test_nmr_coerce = df_.dropna(axis=1, how='all')
display(test_nmr_coerce)
B C
0 1.0 4.0
1 2.0 NaN
2 3.0 6.0
You may have to determine which columns are np.nan values in the original data for accuracy. I merged the original np.nan columns back in with the converted data, df_:
nacols = [c for c in df.columns if c not in df.dropna(axis=1, how='all').columns]
display(pd.merge(test_nmr_coerce,
df[nacols],
right_index=True, left_index=True))
B C D
0 1.0 4.0 NaN
1 2.0 NaN NaN
2 3.0 6.0 NaN
If you want to check for numeric types in Pandas but exclude Booleans and complex numbers, you can use pandas.api.types.is_any_real_numeric_dtype()
which was introduced in Pandas 2.0.0 (April 2023).
import pandas as pd
from pandas.api.types import is_any_real_numeric_dtype
df = pd.DataFrame(
{
"A": [1, 2, 3],
"B": [1.0, 2.0, 3.0],
"C": [1j, 2j, 3j],
"D": ["a", "b", "c"],
"E": [True, False, True],
}
)
is_any_real_numeric_dtype(df["A"])
>>> True
is_any_real_numeric_dtype(df["B"])
>>> True
is_any_real_numeric_dtype(df["C"])
>>> False
is_any_real_numeric_dtype(df["D"])
>>> False
is_any_real_numeric_dtype(df["E"])
>>> False
dtype.kind in 'biufc'.