1

I have a data frame like this

    col1    col2 
    [A, B]   1
    [A, C]   2

I would like to separate col1 into two columns and the output, I would like it out in this form

col1_A  col1_B  col2
  A       B       1
  A       C       2

I have tried this df['col1'].str.rsplit(',',n=2, expand=True) but it showed TypeError: list indices must be integers or slices, not str

1
  • df[['col_A','col_b']]=pd.DataFrame(df['col1'].tolist()) Commented Oct 23, 2018 at 16:10

4 Answers 4

3

join + pop

df = df.join(pd.DataFrame(df.pop('col1').values.tolist(),
                          columns=['col1_A', 'col1_B']))

print(df)

   col2 col1_A col1_B
0     1      A      B
1     2      A      C

It's good practice to try and avoid pd.Series.apply, which often amounts a Python-level loop with an additional overhead.

Sign up to request clarification or add additional context in comments.

5 Comments

@AntonvBR, I haven't look at the source but test it.. %timeit df['col1'].tolist() vs %timeit df['col1'].values.tolist(). In general, (I feel) the Pandas API seems to add non-O(1) overheads to simple methods :S
@AntonvBR, What worries me is that it's not a fixed factor slower. It's something O(n). Really the Pandas method should be able to short-circuit straight to the NumPy method when applicable.
Maybe by setting some global setting to the dataframe class? or decorate it
Ok I'm looking at the source now and yes. it is the datetime control it does. It is the difference between: pd.Series([pd.Timestamp('2018'), pd.Timestamp('2019')]).values.tolist(), pd.Series([pd.Timestamp('2018'), pd.Timestamp('2019')]).tolist(). And from what I can see the series values are sent to the function instead of a dtype control. To handle mixed types? Source: github.com/pandas-dev/pandas/blob/…
Sorry for posting a lot here.... Ok so the function check if it is datetimelike: so this function github.com/pandas-dev/pandas/blob/… with multiple or tests.... Why can't it simply make a simple dtype check instead of this?
1

You can use apply:

import pandas as pd
df = pd.DataFrame({
    "col1": [['A', 'B'], ['A', 'C']],
    "col2": [1, 2],
})
df['col1_A'] = df['col1'].apply(lambda x: x[0])
df['col1_B'] = df['col1'].apply(lambda x: x[1])
del df['col1']
df = df[df.columns[[1,2,0]]]
print(df)

  col1_A col1_B  col2
0      A      B     1
1      A      C     2

1 Comment

df['col1'].str[0]
0

You can do this:

>> df_expanded = df['col1'].apply(pd.Series).rename(
     columns = lambda x : 'col1_' + str(x))

>> df_expanded

  col1_0 col1_1
0      A      B
1      A      C

Adding these columns to the original dataframe:

>> pd.concat([df_expanded, df], axis=1).drop('col1', axis=1)

  col1_0 col1_1  col2
0      A      B     1
1      A      C     2

If columns need to be named as the first element in the rows:

df_expanded.columns =  ['col1_' + value
                        for value in df_expanded.iloc[0,:].values.tolist()]

  col1_A col1_B
0      A      B
1      A      C

2 Comments

There is an add prefix function to the dataframe class.
well, the important things is that it can be done applying pandas.Series to the column
0

Zip values and column name and use insert to get right position.

for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
    df.insert(ind, v, k)

Full example

import pandas as pd

df = pd.DataFrame({
    "col1": [['A', 'B'], ['A', 'C']],
    "col2": [1, 2],
})

for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
    df.insert(ind, v, k)

print(df)

Returns:

  col1_A col1_B  col2
0      A      B     1
1      A      C     2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.