Why does Pandas loc with multiindex return a matrix with single row

Ask Question

Asked 1 year, 8 months ago

Modified 1 year, 8 months ago

Viewed 85 times

This question is similar to Pandas selecting by label sometimes return Series, sometimes returns DataFrame, however I didn't find a solution there. I have 2 dataframes read from CSV with a multi-index (str,int).

data1 = pd.read_csv(file1, sep=";", index_col=['pdf_name', 'page'])
data2 = pd.read_csv(file2, sep=";", index_col=['pdf_name', 'page'])
idx = data1.index[0]  # first index: ('0147S00044', 0)
data1.loc[idx]  # returns Series, as I would expect
data2.loc[idx]  # returns 1xN DataFrame
data2['col1'].loc[idx]  # returns Series with 1 value
data2.loc[idx[0]].loc[idx[1]]  # returns Series -- how is this different from above???
data2['col1'].loc[idx[0]].loc[idx[1]]  # returns actual value

the docs describe the behaviour with data1, which also makes sense to me. What is happening with data2, why does it behave in this rather weird way?

EDIT: working example:

import pandas as pd
from io import StringIO

file1 = StringIO("pdf_name;page;col1;col2\npdf1;0;val1;val2\npdf2;0;asdf;ffff")
file2 = StringIO("pdf_name;page;col1;col2\npdf1;0;;\npdf2;0;;\npdf2;0;;")
data1 = pd.read_csv(file1, sep=";", index_col=['pdf_name', 'page'])
data2 = pd.read_csv(file2, sep=";", index_col=['pdf_name', 'page'])
data2 = data2.sort_index()  # data2.sort_index()  # avoid performance warning
idx = data1.index[0]
print(idx)  # ('pdf1', 0)
print("data1.loc[idx]", type(data1.loc[idx]))  # data1.loc[idx] <class 'pandas.core.series.Series'>
print("data2.loc[idx]", type(data2.loc[idx]))  # data2.loc[idx] <class 'pandas.core.frame.DataFrame'>
print("data2.loc[idx].shape", data2.loc[idx].shape)  # (1, 2)  -- single row
print("data2['col1'].loc[idx]", type(data2['col1'].loc[idx]))  # data2['col1'].loc[idx] <class 'pandas.core.series.Series'>

I figured this happens whenever the dataset has at least two rows with identical index, even if the queried index does not have any duplicates. Is this wanted behaviour?

edited Apr 2, 2024 at 17:19

asked Apr 2, 2024 at 16:23

N4ppeL

1,85719 silver badges22 bronze badges

This is interesting, I couldn't figure out what's happening behind. You can create an issue on Pandas github repo to let develpoers to review it.

Ynjxsjmh
– Ynjxsjmh

2024-04-03 15:37:40 +00:00
Commented Apr 3, 2024 at 15:37
If the index has duplicates, as in data2, how else could/should pandas treat this? Isn't the only way to still read those rows to take them as a DataFrame? The index ('pdf1', 0) in this case points to a DataFrame, so I believe this behavior is both expected and wanted.

mudskipper
– mudskipper

2024-04-09 14:16:05 +00:00
Commented Apr 9, 2024 at 14:16

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Why does Pandas loc with multiindex return a matrix with single row

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked