0

I have a df.

df=pd.DataFrame(data=[[301,301,302,303],[['a'],['b','c'],['e','f',33,'Z'],42],index=['id','foo']).T

I would like to get to a second dataframe, with only scalar values in foo. If and only if the original values was a list, I would like to spread it over multiple new rows, with the other values duplicated.

e.g from:

 id     foo
0  301     [a]
1  301  [b, c]
2  302  [e, f,33,'Z']
3  303   42

to:

 id     foo
0  301     a
1  301     b
1  301     c 
2  302     e
2  302     f
2  302     33
2  302     Z
3  303     42

In Split set values from Pandas dataframe cell over multiple rows, I have learned how to do this for one columns, but how to handle the case where df has multiple columns which need to be duplicated as id?

2 Answers 2

1

If you want avoid use apply(pd.Series) because slow, here is another solution - convert non lists values to one element list first and then apply solution:

df['foo']  = [x if isinstance(x, list) else [x] for x in df['foo']]

from itertools import chain

df = pd.DataFrame({
    'id' : df['id'].values.repeat(df['foo'].str.len()),
    'foo' : list(chain.from_iterable(df['foo'].tolist()))

})

Or:

L  = [x if isinstance(x, list) else [x] for x in df['foo']]

from itertools import chain

df = pd.DataFrame({
    'id' : df['id'].values.repeat([len(x) for x in L]),
    'foo' : list(chain.from_iterable(L))

})
print (df)
    id foo
0  301   a
1  301   b
2  301   c
3  302   e
4  302   f
5  302  33
6  302   Z
7  303  42

If small data or performance is not important - solution with pop for extract column foo:

s = df.pop('foo').apply(pd.Series).stack().reset_index(level=1, drop=True).rename('foo')
df = df.join(s).reset_index(drop=True)

Or solution with drop:

s = df['foo'].apply(pd.Series).stack().reset_index(level=1, drop=True).rename('foo')
df = df.drop('foo', axis=1).join(s).reset_index(drop=True)

print (df)

    id foo
0  301   a
1  301   b
2  301   c
3  302   e
4  302   f
5  302  33
6  302   Z
7  303  42

df=pd.DataFrame(data=[[301,301,302,303],[['a'],['b','c'],['e','f',33,'Z'],42]],index=['id','foo']).T

df = pd.concat([df] * 1000, ignore_index=True)

def f(df):
    s = df['foo'].apply(pd.Series).stack().reset_index(level=1, drop=True).rename('foo')
    return df.drop('foo', axis=1).join(s).reset_index(drop=True)


In [241]: %timeit (f(df))
814 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [242]: %%timeit
     ...: L  = [x if isinstance(x, list) else [x] for x in df['foo']]
     ...: 
     ...: from itertools import chain
     ...: 
     ...: pd.DataFrame({
     ...:     'id' : df['id'].values.repeat([len(x) for x in L]),
     ...:     'foo' : list(chain.from_iterable(L))
     ...: 
     ...: })
     ...: 
2.6 ms ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sign up to request clarification or add additional context in comments.

Comments

1
df.set_index('id')['foo'].apply(pd.Series).stack().reset_index(name='foo').drop('level_1', axis=1)

Output

    id foo
0  301   a
1  301   b
2  301   c
3  302   e
4  302   f
5  302  33
6  302   Z
7  303  42

Multiple Columns Scenario

    id test            foo
0  301    1            [a]
1  301    2         [b, c]
2  302    3  [e, f, 33, Z]
3  303    4             42

using pd.set_index

df.set_index(['id','test'])['foo'].apply(pd.Series).stack().reset_index(name='foo').drop('level_2', axis=1)

Output

    id  test foo
0  301     1   a
1  301     2   b
2  301     2   c
3  302     3   e
4  302     3   f
5  302     3  33
6  302     3   Z
7  303     4  42

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.