2

I have a pandas Series of dictionnaries, and I want to convert it to a data frame with the same index.

The only way I found is to pass through the to_dict method of the series, which is not very efficient because it goes back to pure python mode instead of numpy/pandas/cython.

Do you have suggestions for a better approach?

Thanks a lot.

>>> import pandas as pd
>>> flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
>>> flagInfoSeries
0      {'a': 1, 'b': 2}
1    {'a': 10, 'b': 20}
dtype: object
>>> pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
0   1   2
1  10  20
0

3 Answers 3

4

I think you can use comprehension:

import pandas as pd

flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
print flagInfoSeries
0      {u'a': 1, u'b': 2}
1    {u'a': 10, u'b': 20}
dtype: object

print pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
0   1   2
1  10  20

print pd.DataFrame([x for x in flagInfoSeries])
    a   b
0   1   2
1  10  20

Timing:

In [203]: %timeit pd.DataFrame(flagInfoSeries.to_dict()).T
The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 554 µs per loop

In [204]: %timeit pd.DataFrame([x for x in flagInfoSeries])
The slowest run took 5.11 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 361 µs per loop

In [209]: %timeit flagInfoSeries.apply(lambda dict: pd.Series(dict))
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 751 µs per loop

EDIT:

If you need keep index, try add index=flagInfoSeries.index to DataFrame constructor:

print pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)

Timings:

In [257]: %timeit pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)
1000 loops, best of 3: 350 µs per loop

Sample:

import pandas as pd

flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
flagInfoSeries.index = [2,8]
print flagInfoSeries
2      {u'a': 1, u'b': 2}
8    {u'a': 10, u'b': 20}

print pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
2   1   2
8  10  20

print pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)
    a   b
2   1   2
8  10  20
Sign up to request clarification or add additional context in comments.

8 Comments

Yep, so your computer is faster, but your code still wins :)
Yes, you are right. I want add comparing in my PC. :)
Thanks for those suggestions. Indeed, there are improvements in terms of performances ... but the indexes are not kept: the list comprehension gives the a list [{mydict}, ...], without the index, while the to_dict gives a dictionnary of {index: {mydict}, ...}. I think I'll keep it like this for now.
Solution was modified, please check it.
It's even faster with the index!
|
1

You can use pd.json_normalize(flagInfoSeries).

Comments

0

This avoids to_dict, but apply could be slow too:

flagInfoSeries.apply(lambda dict: pd.Series(dict))

Edit: I see that jezrael has added timing comparisons. Here is mine:

%timeit flagInfoSeries.apply(lambda dict: pd.Series(dict))
1000 loops, best of 3: 935 µs per loop

1 Comment

Thanks. I've tried this, but indeed, apply is slow.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.