24

In the data I am working with the index is compound - i.e. it has both item name and a timestamp, e.g. [email protected]|2013-05-07 05:52:51 +0200.

I want to do hierarchical indexing, so that the same e-mails are grouped together, so I need to convert a DataFrame Index into a MultiIndex (e.g. for the entry above - ([email protected], 2013-05-07 05:52:51 +0200)).

What is the most convenient method to do so?

3 Answers 3

28

Once we have a DataFrame

import pandas as pd
df = pd.read_csv("input.csv", index_col=0)  # or from another source

and a function mapping each index to a tuple (below, it is for the example from this question)

def process_index(k):
    return tuple(k.split("|"))

we can create a hierarchical index in the following way:

df.index = pd.MultiIndex.from_tuples([process_index(k) for k,v in df.iterrows()])

An alternative approach is to create two columns then set them as the index (the original index will be dropped):

df['e-mail'] = [x.split("|")[0] for x in df.index] 
df['date'] = [x.split("|")[1] for x in df.index]
df = df.set_index(['e-mail', 'date'])

or even shorter

df['e-mail'], df['date'] = zip(*map(process_index, df.index))
df = df.set_index(['e-mail', 'date'])
Sign up to request clarification or add additional context in comments.

3 Comments

This was very helpful. But, as far as I can see, when calling set_index() the default is inplace=False, so one has to use inplace=True or else assign df back to itself.
@Moot Thanks, updated. Either a typo or back them (4 years ago) it was inplace by default.
Thanks! I was too fast and careless.
14

In pandas>=0.16.0, we can use the .str accessor on indices. This makes the following possible:

df.index = pd.MultiIndex.from_tuples(df.index.str.split('|').tolist())

(Note: I tried the more intuitive: pd.MultiIndex.from_arrays(df.index.str.split('|')) but for some reason that gives me errors.)

Comments

5

My preference would be to initially read this in as a column (i.e. not as an index), then you can use the str split method:

csv = '\n'.join(['[email protected]|2013-05-07 05:52:51 +0200, 42'] * 3)
df = pd.read_csv(StringIO(csv), header=None)

In [13]: df[0].str.split('|')
Out[13]:
0    [[email protected], 2013-05-07 05:52:51 +0200]
1    [[email protected], 2013-05-07 05:52:51 +0200]
2    [[email protected], 2013-05-07 05:52:51 +0200]
Name: 0, dtype: object

And then feed this into a MultiIndex (perhaps this can be done cleaner?):

m = pd.MultiIndex.from_arrays(zip(*df[0].str.split('|')))

Delete the 0th column and set the index to the new MultiIndex:

del df[0]
df.index = m

In [17]: df
Out[17]:
                                            1
[email protected] 2013-05-07 05:52:51 +0200  42
                2013-05-07 05:52:51 +0200  42
                2013-05-07 05:52:51 +0200  42

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.