Python - how to split list for creating new column? pandas

Question

I have a data frame like this

    col1    col2 
    [A, B]   1
    [A, C]   2

I would like to separate col1 into two columns and the output, I would like it out in this form

col1_A  col1_B  col2
  A       B       1
  A       C       2

I have tried this df['col1'].str.rsplit(',',n=2, expand=True) but it showed TypeError: list indices must be integers or slices, not str

df[['col_A','col_b']]=pd.DataFrame(df['col1'].tolist())

BENY
– BENY

2018-10-23 16:10:09 +00:00
Commented Oct 23, 2018 at 16:10 — BENY
– BENY, Commented Oct 23, 2018 at 16:10

jpp · Accepted Answer · 2018-10-23 16:30:27Z

3

`join` + `pop`

df = df.join(pd.DataFrame(df.pop('col1').values.tolist(),
                          columns=['col1_A', 'col1_B']))

print(df)

   col2 col1_A col1_B
0     1      A      B
1     2      A      C

It's good practice to try and avoid pd.Series.apply, which often amounts a Python-level loop with an additional overhead.

answered Oct 23, 2018 at 16:30

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

jpp Over a year ago

@AntonvBR, I haven't look at the source but test it.. %timeit df['col1'].tolist() vs %timeit df['col1'].values.tolist(). In general, (I feel) the Pandas API seems to add non-O(1) overheads to simple methods :S

jpp Over a year ago

@AntonvBR, What worries me is that it's not a fixed factor slower. It's something O(n). Really the Pandas method should be able to short-circuit straight to the NumPy method when applicable.

Anton vBR Over a year ago

Maybe by setting some global setting to the dataframe class? or decorate it

Anton vBR Over a year ago

Ok I'm looking at the source now and yes. it is the datetime control it does. It is the difference between:

pd.Series([pd.Timestamp('2018'), pd.Timestamp('2019')]).values.tolist(), pd.Series([pd.Timestamp('2018'), pd.Timestamp('2019')]).tolist()

. And from what I can see the series values are sent to the function instead of a dtype control. To handle mixed types? Source: github.com/pandas-dev/pandas/blob/…

Anton vBR Over a year ago

Sorry for posting a lot here.... Ok so the function check if it is datetimelike: so this function github.com/pandas-dev/pandas/blob/… with multiple or tests.... Why can't it simply make a simple dtype check instead of this?

André C. Andersen · Accepted Answer · 2018-10-23 16:29:03Z

1

You can use apply:

import pandas as pd
df = pd.DataFrame({
    "col1": [['A', 'B'], ['A', 'C']],
    "col2": [1, 2],
})
df['col1_A'] = df['col1'].apply(lambda x: x[0])
df['col1_B'] = df['col1'].apply(lambda x: x[1])
del df['col1']
df = df[df.columns[[1,2,0]]]
print(df)

  col1_A col1_B  col2
0      A      B     1
1      A      C     2

edited Oct 23, 2018 at 16:29

answered Oct 23, 2018 at 16:25

André C. Andersen

9,5053 gold badges59 silver badges84 bronze badges

1 Comment

BENY Over a year ago

df['col1'].str[0]

Mabel Villalba · Accepted Answer · 2018-10-23 16:31:46Z

0

You can do this:

>> df_expanded = df['col1'].apply(pd.Series).rename(
     columns = lambda x : 'col1_' + str(x))

>> df_expanded

  col1_0 col1_1
0      A      B
1      A      C

Adding these columns to the original dataframe:

>> pd.concat([df_expanded, df], axis=1).drop('col1', axis=1)

  col1_0 col1_1  col2
0      A      B     1
1      A      C     2

If columns need to be named as the first element in the rows:

df_expanded.columns =  ['col1_' + value
                        for value in df_expanded.iloc[0,:].values.tolist()]

  col1_A col1_B
0      A      B
1      A      C

edited Oct 23, 2018 at 16:31

answered Oct 23, 2018 at 16:28

Mabel Villalba

2,59811 silver badges20 bronze badges

2 Comments

Anton vBR Over a year ago

There is an add prefix function to the dataframe class.

Mabel Villalba Over a year ago

well, the important things is that it can be done applying pandas.Series to the column

Anton vBR · Accepted Answer · 2018-10-23 17:18:52Z

0

Zip values and column name and use insert to get right position.

for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
    df.insert(ind, v, k)

Full example

import pandas as pd

df = pd.DataFrame({
    "col1": [['A', 'B'], ['A', 'C']],
    "col2": [1, 2],
})

for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
    df.insert(ind, v, k)

print(df)

Returns:

  col1_A col1_B  col2
0      A      B     1
1      A      C     2

edited Oct 23, 2018 at 17:18

answered Oct 23, 2018 at 17:11

Anton vBR

19k6 gold badges47 silver badges47 bronze badges

Collectives™ on Stack Overflow

Python - how to split list for creating new column? pandas

4 Answers 4

`join` + `pop`

5 Comments

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

join + pop

5 Comments

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

`join` + `pop`