0
df = pd.DataFrame({
    'col_str': ["a", "b", "c"],
    'col_lst_str': [["a", "b", "c"], ["d", "e", "f"], ["g", "h",  "i"]],
    'col_lst_int': [[1, 2, 3], [4, 5, 6], [7, 8, 9]], 
    'col_arr_int': [np.array([1, 2, 3]),np.array([4, 5, 6]), np.array([7, 8, 9])]
})

print(df.dtypes)
print(pd.api.types.is_object_dtype(df['col_lst_int'].dtype))              # return True expected !
print(pd.api.types.is_object_dtype(df['col_arr_int'].dtype))              # return True expected !
print(pd.api.types.is_string_dtype(df['col_lst_int'].dtype))              # return True confusing !!
print(pd.api.types.is_string_dtype(df['col_arr_int'].dtype))              # return True confusing !!
print(df['col_lst_int'].apply(lambda x: isinstance(x, list)).all())       # return True expected !
print(df['col_arr_int'].apply(lambda x: isinstance(x, np.ndarray)).all()) # return True expected !

When a pandas dataframe column contains lists or numpy arrays of integer elements (column dtype=object) both pd.api.types.is_object_dtype() and pd.api.types.is_string_dtype() return True which is completely misleading. I was expecting that pd.api.types.is_string_dtype() will return False. Now my column is seems to have two valid dtypes, dtype = object and dtype = string which can cause serious problemes in conditionnal logics. Even the API doc official is misleading claiming that the element must be inferred as string. How come elements 1 2 3 can be infered as string in my example ? It seems to works as expected with pandas Series though , Is it a bug with dataframes ?

Doc

5
  • 1
    df['col_lst_int'] => dtype: object not int, and is_string_dtype(object)=> True. But is_string_dtype(pd.Series([1, 2])) is False. Why did you expect a list of int is typed int ? Commented Aug 1 at 14:23
  • if you think it is bug then maybe you should send this to authors of pandas. But maybe first you could check source code Commented Aug 1 at 14:28
  • 1
    @furas difficult to imagine is_string_dtype(object)=> True is a bug because this is explicitly said in doc. And how to imagine a list (of int or whatever) is not an object ? Commented Aug 1 at 14:33
  • 1
    @furas But as expected, having df2 = pd.DataFrame({'col_int':[1,2,3]}) then df2['col_int'].dtype => int64, and pd.api.types.is_string_dtype(df2['col_int']) => False Commented Aug 1 at 14:38
  • @bruno: is_string_dtype(object)=> True is indeed the answer ( thank for pointing that) . I was focused on "the element must be inferred as string". Despite the one example in the doc is_string_dtype(object)=> True render things very very confusing . This loophole should be expressed more clearly because now a column seems to have 2 dtypes (object the real one when using is_object and string the wrong one when using is_string). I know that pandas has historical reasons as string columns are by default stored as object but you can't expect everybody to know this level of detail. Commented Aug 1 at 17:32

1 Answer 1

1

A pandas Series with dtype=object is a generic container. It doesn't store a specific, uniform data type like a Series with dtype=int64. Instead, it stores Python objcts of any type (integers, strings, lists, dictionaries, etc.).

The function pd.api.types.is_string_dtype() is designed to check for explicit string dtypes (like 'string') and also to handle the legacy object dtype, which was historically the only way to store text data in pandas.

Pandas, for backward compatibility and performance reasons, doesn't deeply inspect every single element of an object dtype column to determine its content. When it encounters dtype=object, it performs a heuristic check. It's often a conservative check that seems to return True for a variety of non-string object dtypes. The documentation for is_string_dtype is a bit ambiguous, stating that for an object dtype array, "the elements must be inferred as strings," which doesn't always match the actual content.

This behavior is not a bug in the traditional sense, but rather an initial design choice. The ambiguity of dtype=object is precisly why the dedicated StringDtype (accessible via dtype='string') was introduced in pandas (1.0 I think).

is_string_dtype behavior is a remnant of a time when dtype=object was the only way to store strings, and the function needed to be broad enough to catch that case.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.