This question is similar to Pandas selecting by label sometimes return Series, sometimes returns DataFrame, however I didn't find a solution there. I have 2 dataframes read from CSV with a multi-index (str,int).
data1 = pd.read_csv(file1, sep=";", index_col=['pdf_name', 'page'])
data2 = pd.read_csv(file2, sep=";", index_col=['pdf_name', 'page'])
idx = data1.index[0] # first index: ('0147S00044', 0)
data1.loc[idx] # returns Series, as I would expect
data2.loc[idx] # returns 1xN DataFrame
data2['col1'].loc[idx] # returns Series with 1 value
data2.loc[idx[0]].loc[idx[1]] # returns Series -- how is this different from above???
data2['col1'].loc[idx[0]].loc[idx[1]] # returns actual value
the docs describe the behaviour with data1, which also makes sense to me. What is happening with data2, why does it behave in this rather weird way?
EDIT: working example:
import pandas as pd
from io import StringIO
file1 = StringIO("pdf_name;page;col1;col2\npdf1;0;val1;val2\npdf2;0;asdf;ffff")
file2 = StringIO("pdf_name;page;col1;col2\npdf1;0;;\npdf2;0;;\npdf2;0;;")
data1 = pd.read_csv(file1, sep=";", index_col=['pdf_name', 'page'])
data2 = pd.read_csv(file2, sep=";", index_col=['pdf_name', 'page'])
data2 = data2.sort_index() # data2.sort_index() # avoid performance warning
idx = data1.index[0]
print(idx) # ('pdf1', 0)
print("data1.loc[idx]", type(data1.loc[idx])) # data1.loc[idx] <class 'pandas.core.series.Series'>
print("data2.loc[idx]", type(data2.loc[idx])) # data2.loc[idx] <class 'pandas.core.frame.DataFrame'>
print("data2.loc[idx].shape", data2.loc[idx].shape) # (1, 2) -- single row
print("data2['col1'].loc[idx]", type(data2['col1'].loc[idx])) # data2['col1'].loc[idx] <class 'pandas.core.series.Series'>
I figured this happens whenever the dataset has at least two rows with identical index, even if the queried index does not have any duplicates. Is this wanted behaviour?