Pandas expand rows from list data available in column [duplicate]

Question

I have a data frame like this in pandas:

 column1      column2
 [a,b,c]        1
 [d,e,f]        2
 [g,h,i]        3

Expected output:

column1      column2
  a              1
  b              1
  c              1
  d              2
  e              2
  f              2
  g              3
  h              3
  i              3

How to process this data ?

What is print (type(df.ix[0, 'column1']) ?

jezrael
– jezrael

2016-08-18 06:43:48 +00:00
Commented Aug 18, 2016 at 6:43 — jezrael
– jezrael, Commented Aug 18, 2016 at 6:43
print (type(df.ix[0, 'column1']) :--- is list

Sanjay Yadav
– Sanjay Yadav

2016-08-18 06:49:00 +00:00
Commented Aug 18, 2016 at 6:49 — Sanjay Yadav
– Sanjay Yadav, Commented Aug 18, 2016 at 6:49

Erfan · Accepted Answer · 2020-11-10 14:06:14Z

82

`DataFrame.explode`

Since pandas >= 0.25.0 we have the explode method for this, which expands a list to a row for each element and repeats the rest of the columns:

df.explode('column1').reset_index(drop=True)

Output


  column1  column2
0       a        1
1       b        1
2       c        1
3       d        2
4       e        2
5       f        2
6       g        3
7       h        3
8       i        3

Since pandas >= 1.1.0 we have the ignore_index argument, so we don't have to chain with reset_index:

df.explode('column1', ignore_index=True)

Output

  column1  column2
0       a        1
1       b        1
2       c        1
3       d        2
4       e        2
5       f        2
6       g        3
7       h        3
8       i        3

edited Nov 10, 2020 at 14:06

answered Mar 9, 2019 at 16:35

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Shiva Rama Krishna Over a year ago

if you are using pandas < 0.25.0 i made a patch to make it running below gist.github.com/BurakaKrishna/538cdad998247b95f9b2898015360a8e

Erfan Over a year ago

I see yours is using lot of for loops, I would not advice people to use that approach, here are better vectorized alternatives for pandas < 0.25.0 @ShivaRamaKrishna

topher217 Over a year ago

Is there a good way to do this without lists as your index? For example, say I have to dateframes one with Timestamps with second accuracy, and another with only minute accuracy. I want to expand the one with minute accuracy by duplicating all values 60 times so I can merge them. I guess I can create a new index with a list of length 60 in each and do this explode method, but wondered if there is a more pandas way of doing this.

Erfan Over a year ago

That looks like a resample from minute to second problem, not explode perse @topher217.

topher217 Over a year ago

@Erfan perfect! Yes, I knew there had to be something. resample with pad or bfill looks like a great way to get this done. Thanks!

|

jezrael · Accepted Answer · 2016-08-18 07:11:12Z

You can create DataFrame by its constructor and stack:

 df2 = pd.DataFrame(df.column1.tolist(), index=df.column2)
        .stack()
        .reset_index(level=1, drop=True)
        .reset_index(name='column1')[['column1','column2']]
print (df2)

  column1  column2
0       a        1
1       b        1
2       c        1
3       d        2
4       e        2
5       f        2
6       g        3
7       h        3
8       i        3

If need change ordering by subset [['column1','column2']], you can also omit first reset_index:

df2 = pd.DataFrame(df.column1.tolist(), index=df.column2)
        .stack()
        .reset_index(name='column1')[['column1','column2']]
print (df2)
  column1  column2
0       a        1
1       b        1
2       c        1
3       d        2
4       e        2
5       f        2
6       g        3
7       h        3
8       i        3

Another solution DataFrame.from_records for creating DataFrame from first column, then create Series by stack and join to original DataFrame:

df = pd.DataFrame({'column1': [['a','b','c'],['d','e','f'],['g','h','i']],
                   'column2':[1,2,3]})


a = pd.DataFrame.from_records(df.column1.tolist())
                .stack()
                .reset_index(level=1, drop=True)
                .rename('column1')

print (a)
0    a
0    b
0    c
1    d
1    e
1    f
2    g
2    h
2    i
Name: column1, dtype: object

print (df.drop('column1', axis=1)
         .join(a)
         .reset_index(drop=True)[['column1','column2']])

  column1  column2
0       a        1
1       b        1
2       c        1
3       d        2
4       e        2
5       f        2
6       g        3
7       h        3
8       i        3

In typical pandas fashion, this fails if the column consists of empty lists. Perfect.

bencekd · Accepted Answer · 2019-05-31 21:33:03Z

8

Another solution is to use the result_type='expand' argument of the pandas.apply function available since pandas 0.23. Answering @splinter's question this method can be generalized -- see below:

import pandas as pd
from numpy import arange

df = pd.DataFrame(
    {'column1' : [['a','b','c'],['d','e','f'],['g','h','i']],
    'column2': [1,2,3]}
)

pd.melt(
    df.join(
        df.apply(lambda row: row['column1'], axis=1, result_type='expand')
        ),
 value_vars=arange(df['column1'].shape[0]), value_name='column1', var_name='column2')[['column1','column2']]

# can be generalized 

df = pd.DataFrame(
    {'column1' : [['a','b','c'],['d','e','f'],['g','h','i']],
    'column2': [1,2,3],
    'column3': [[1,2],[2,3],[3,4]],
    'column4': [42,23,321],
    'column5': ['a','b','c']}
)

(pd.melt(
    df.join(
        df.apply(lambda row: row['column1'], axis=1, result_type='expand')
        ),
 value_vars=arange(df['column1'].shape[0]), value_name='column1', id_vars=df.columns[1:])
 .drop(columns=['variable'])[list(df.columns[:1]) + list(df.columns[1:])]
 .sort_values(by=['column1']))

UPDATE (for Jwely's comment): if you have lists with varying length, you can do:

df = pd.DataFrame(
    {'column1' : [['a','b','c'],['d','f'],['g','h','i']],
    'column2': [1,2,3]}
)

longest = max(df['column1'].apply(lambda x: len(x)))

pd.melt(
    df.join(
        df.apply(lambda row: row['column1'] if len(row['column1']) >= longest else row['column1'] + [None] * (longest - len(row['column1'])), axis=1, result_type='expand')
    ),
 value_vars=arange(df['column1'].shape[0]), value_name='column1', var_name='column2').query("column1 == column1")[['column1','column2']]

edited May 31, 2019 at 21:33

answered Dec 1, 2018 at 12:47

bencekd

1,6151 gold badge13 silver badges9 bronze badges

2 Comments

Jwely Over a year ago

I believe this solution requires every list in "column1" to be of the same length, 3 in this case.

bencekd Over a year ago

I think the question was about lists with same length in the first column, but with slight modifications you can do different list lengths -- see my edit

Collectives™ on Stack Overflow

Pandas expand rows from list data available in column [duplicate]

Expected output:

3 Answers 3

`DataFrame.explode`

6 Comments

1 Comment

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Expected output:

3 Answers 3

DataFrame.explode

6 Comments

1 Comment

2 Comments

Linked

Related

`DataFrame.explode`