2

Context

So I'm iterating through a bunch of files where each file is a subject, and in each file there are 3 columns, each representing the x,y,z axis at a certain point (the lengths across files are not the same). I want to put all of them into a multi-index PD df.

What I've tried

I found this post and when I do it, it seems to work

d_ = dict()
DATA_ROOT = "../sample_data/chest_mounted/"
cutoff_min = 0
for fileName in os.listdir(DATA_ROOT):
    if ".csv" in fileName and '.swp' not in fileName:
        with open(DATA_ROOT + fileName) as f:
            data = np.asarray(list(map(lambda x: x.strip().split(",")[1:-1], f.readlines())), dtype=np.int)
            subj_key = "Subject_" + str(fileName.split(".")[0])
            d_[subj_key] = pd.DataFrame(data, columns=['x_acc', 'y_acc', 'z_acc'])
df = pd.concat(d_.values(), keys=d_.keys())

When I do df.head() it looks exactly like what I want (I think?)

                x_acc   y_acc   z_acc
Subject_1   0   1502    2215    2153
            1   1667    2072    2047
            2   1611    1957    1906
            3   1601    1939    1831
            4   1643    1965    1879

The Problem

However, when I try to index by Subject_x I get an error. Instead, I have to first do something like

df["x_acc"]["Subject_1"] 

where I access the x_acc first then the Subject_1.

Questions

1) I had the impression that I was creating a multi-index but trying df["x_acc"]["Subject_1"] that does not seem to be the case. How do I transform it to that?

2) Is there any way to change the index so that I access by Subject first?

1 Answer 1

2

Use loc for selecting - first by level of MultiIndex and then by column name or xs implemented for simple selections:

df = df.loc['Subject_1', 'x_acc']
print (df)
0    1502
1    1667
2    1611
3    1601
4    1643
Name: x_acc, dtype: int64

df = df.xs('Subject_1')
print (df)
   x_acc  y_acc  z_acc
0   1502   2215   2153
1   1667   2072   2047
2   1611   1957   1906
3   1601   1939   1831
4   1643   1965   1879

And for more complicated selections use slicers:

idx = pd.IndexSlice

df = df.loc['Subject_1', idx['x_acc','y_acc']]
print (df)
   x_acc  y_acc
0   1502   2215
1   1667   2072
2   1611   1957
3   1601   1939
4   1643   1965

Also it seems your code should be simplify by read_csv:

d_ = dict()
DATA_ROOT = "../sample_data/chest_mounted/"
cutoff_min = 0
for fileName in os.listdir(DATA_ROOT):
    if ".csv" in fileName and '.swp' not in fileName:
        subj_key = "Subject_" + str(fileName.split(".")[0])
        d_[subj_key] = pd.read_csv(fileName,  names=['x_acc', 'y_acc', 'z_acc'])

df = pd.concat(d_)
Sign up to request clarification or add additional context in comments.

3 Comments

awesome! Thank you so much it worked. Also nice one re: read_csv it was way faster than what I was doing
I noticed that OP doesn't use all items in the rows of the file, only slicing [1:-1], so you need to modify the pd.read_csv a little bit.
@StayFoolish - Yes, if need remove first and last row - d_[subj_key] = pd.read_csv(fileName, names=['x_acc', 'y_acc', 'z_acc']).iloc[1:-1] should working.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.