1
my_df = pd.DataFrame({'ID':['12345','23456','34567'],
         'Info':[[['Rob Kardashian', '00052369', '1987-03-17', 'Reality Star'], ['Brooke Barry', '00213658', '2001-03-30', 'TikTok Star']],
                [['Bae De Leon', '00896351', '1997-08-02', 'Volleyball Player'],['Jonas Blue', '02369785', '1990-08-02', 'Music Producer'],['Albert Einstein', '65231478', '1879-03-14','Scientist']],
                [['Robert Downey Jr', '23897410', '1965-04-04', 'Actor'],['Stan Lee','35239856','1922-12-28','Publisher & Producer']]]})

enter image description here

Hi folks, I have above dataframe and want to convert the elements in column 'Info' to rows. I tried

[[pd.DataFrame(i) for i in k] for k in my_df ['Info'].tolist()]

But the outputs are not what I expected.

Expected outputs: enter image description here

Thanks in advance for the help!

2
  • Why not have the ID repeated for the nested lists? Commented Jul 12, 2019 at 17:16
  • Yes. It also works for me. But just not sure how to create the table above. Commented Jul 12, 2019 at 17:18

2 Answers 2

1

You could use grouping:

my_df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))

This aggregates the returned dataframes for you:

>>> my_df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))
                        0         1           2                     3
ID
12345 0    Rob Kardashian  00052369  1987-03-17          Reality Star
      1      Brooke Barry  00213658  2001-03-30           TikTok Star
23456 0       Bae De Leon  00896351  1997-08-02     Volleyball Player
      1        Jonas Blue  02369785  1990-08-02        Music Producer
      2   Albert Einstein  65231478  1879-03-14             Scientist
34567 0  Robert Downey Jr  23897410  1965-04-04                 Actor
      1          Stan Lee  35239856  1922-12-28  Publisher & Producer

You can then choose to reset the index and drop the level_1 column:

expanded = my_df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))
expanded.reset_index().drop("level_1", axis=1)

which gives you

      ID                 0         1           2                     3
0  12345    Rob Kardashian  00052369  1987-03-17          Reality Star
1  12345      Brooke Barry  00213658  2001-03-30           TikTok Star
2  23456       Bae De Leon  00896351  1997-08-02     Volleyball Player
3  23456        Jonas Blue  02369785  1990-08-02        Music Producer
4  23456   Albert Einstein  65231478  1879-03-14             Scientist
5  34567  Robert Downey Jr  23897410  1965-04-04                 Actor
6  34567          Stan Lee  35239856  1922-12-28  Publisher & Producer

Because this uses GroupBy.apply(), I don't expect this to be all that fast, however.

Having encapsulated Andy's and my versions in functions to run time trials indeed shows using my version would be the slower option:

In [99]: def np_concat(df):
    ...:     df = df.set_index('ID')
    ...:     pd.DataFrame(np.concatenate(my_df.Info), index=my_df.index.repeat(my_df.Info.str.len()))
    ...:

In [100]: def groupby(df):
     ...:    df = df.groupby("ID").Info.apply(lambda g: pd.DataFrame(g.iloc[0]))
     ...:    df.reset_index().drop("level_1", axis=1)
     ...:

In [101]: %timeit np_concat(my_df)
1.08 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [102]: %timeit groupby(my_df)
6.33 ms ± 394 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for walking through the answers!
0

Is this what you want:

my_df = my_df.set_index('ID')
pd.DataFrame(np.concatenate(my_df.Info), \
             index=my_df.index.repeat(my_df.Info.str.len()))

Out[1102]:
                      0         1           2                     3
ID
12345    Rob Kardashian  00052369  1987-03-17          Reality Star
12345      Brooke Barry  00213658  2001-03-30           TikTok Star
23456       Bae De Leon  00896351  1997-08-02     Volleyball Player
23456        Jonas Blue  02369785  1990-08-02        Music Producer
23456   Albert Einstein  65231478  1879-03-14             Scientist
34567  Robert Downey Jr  23897410  1965-04-04                 Actor
34567          Stan Lee  35239856  1922-12-28  Publisher & Producer

Note: I leave ID as the index of the output df. If you need it as a column, chain additional .reset_index as follows:

pd.DataFrame(np.concatenate(my_df.Info), \
            index=my_df.index.repeat(my_df.Info.str.len())).reset_index()

Out[1106]:
      ID                 0         1           2                     3
0  12345    Rob Kardashian  00052369  1987-03-17          Reality Star
1  12345      Brooke Barry  00213658  2001-03-30           TikTok Star
2  23456       Bae De Leon  00896351  1997-08-02     Volleyball Player
3  23456        Jonas Blue  02369785  1990-08-02        Music Producer
4  23456   Albert Einstein  65231478  1879-03-14             Scientist
5  34567  Robert Downey Jr  23897410  1965-04-04                 Actor
6  34567          Stan Lee  35239856  1922-12-28  Publisher & Producer

1 Comment

Yes. This solution perfectly solved my problem! Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.