Merge multiple column values into one column in python pandas

Question

I have a pandas data frame like this:

   Column1  Column2  Column3  Column4  Column5
 0    a        1        2        3        4
 1    a        3        4        5
 2    b        6        7        8
 3    c        7        7

What I want to do now is getting a new dataframe containing Column1 and a new columnA. This columnA should contain all values from columns 2 -(to) n (where n is the number of columns from Column2 to the end of the row) like this:

  Column1  ColumnA
0   a      1,2,3,4
1   a      3,4,5
2   b      6,7,8
3   c      7,7

How could I best approach this issue?

EdChum · Accepted Answer · 2019-07-26 19:09:38Z

161

You can call apply pass axis=1 to apply row-wise, then convert the dtype to str and join:

In [153]:
df['ColumnA'] = df[df.columns[1:]].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)
df

Out[153]:
  Column1  Column2  Column3  Column4  Column5  ColumnA
0       a        1        2        3        4  1,2,3,4
1       a        3        4        5      NaN    3,4,5
2       b        6        7        8      NaN    6,7,8
3       c        7        7      NaN      NaN      7,7

Here I call dropna to get rid of the NaN, however we need to cast again to int so we don't end up with floats as str.

edited Jul 26, 2019 at 19:09

answered Oct 13, 2015 at 9:05

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Sade Over a year ago

For some reason this doesnt work for me. I get duplicates. Therefore row 0 columnA is 1,2,3,4,1,2,3,4

Sade Over a year ago

It seems like using iloc works for me. Theres no duplicates. df['ColumnA'] = df.iloc[:,source_col_loc+1:source_col_loc+4].apply( lambda x: ",".join(x.astype(str)), axis=1)

Kaustuv Over a year ago

A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

yeliabsalohcin Over a year ago

For future visitors wanting to just combine just some columns (and control order) you can replace df[df.columns[1:]] with df[['Column4','Column1']]

Derlin · Accepted Answer · 2019-09-12 08:36:03Z

24

I propose to use .assign

df2 = df.assign(ColumnA = df.Column2.astype(str) + ', ' + \
  df.Column3.astype(str) + ', ' df.Column4.astype(str) + ', ' \
  df.Column4.astype(str) + ', ' df.Column5.astype(str))

it's simple, maybe long but it worked for me

edited Sep 12, 2019 at 8:36

Derlin

9,9212 gold badges34 silver badges57 bronze badges

answered Apr 12, 2018 at 8:27

Amin Salgado

2412 silver badges3 bronze badges

1 Comment

Amin Salgado Over a year ago

Also, if you are doing it for tonnes of data, it is much faster than lambda

Om Prakash · Accepted Answer · 2018-12-14 06:45:28Z

If you have lot of columns say - 1000 columns in dataframe and you want to merge few columns based on particular column name e.g. -Column2 in question and arbitrary no. of columns after that column (e.g. here 3 columns after 'Column2 inclusive of Column2 as OP asked).

We can get position of column using .get_loc() - as answered here

source_col_loc = df.columns.get_loc('Column2') # column position starts from 0

df['ColumnA'] = df.iloc[:,source_col_loc+1:source_col_loc+4].apply(
    lambda x: ",".join(x.astype(str)), axis=1)

df

Column1  Column2  Column3  Column4  Column5  ColumnA
0       a        1        2        3        4  1,2,3,4
1       a        3        4        5      NaN    3,4,5
2       b        6        7        8      NaN    6,7,8
3       c        7        7      NaN      NaN      7,7

To remove NaN, use .dropna() or .fillna()

Hope it helps!

trey hannam · Accepted Answer · 2022-10-27 17:07:09Z

10

apply() is 100X slower than agg()

Do NOT use apply, it does not scale well. Instead use df.agg(). Using apply() will take seconds, but agg() will take milliseconds (ms).

Here's an example:

import numpy as np
import pandas as pd

def createList(r1, r2):
    return np.arange(r1, r2+1, 1)

sample_data = createList(1, 100_000) # a list of 100,000 values

test_df = pd.DataFrame(
    [sample_data]
)

test_df.apply(lambda x: ','.join(x.astype(str))) #3.47 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

test_df.astype(str).agg(', '.join, axis=1) #34.8 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you can see from this sample, apply() took an average time of 3.47 seconds whereas agg() took an average time of 34.8 milliseconds. The gap in performance will become bigger as more data is added too.

*Note, I used %%timeit in jupyter notebook to get the run time for each method.

answered Oct 27, 2022 at 17:07

trey hannam

2731 gold badge3 silver badges16 bronze badges

5 Comments

Teamothy Over a year ago

Already tested that and I can confirm that it is much faster.

bonCodigo Over a year ago

Could you share a reproducible code that applies your suggestion to multiple ad-hoc column aggregation? Where to define specific column names in your answer?

trey hannam Over a year ago

@bonCodigo could you provide an example dataframe?

bonCodigo Over a year ago

Try, happy to take a look at this? stackoverflow.com/q/75482636/1389394

fanbyprinciple Over a year ago

for those looking for simpler syntax for aggregate function: df['FullName'] = df[['First_Name', 'Last_Name']].agg('-'.join, axis=1)

Collectives™ on Stack Overflow

Merge multiple column values into one column in python pandas

4 Answers 4

4 Comments

1 Comment

Comments

apply() is 100X slower than agg()

Here's an example:

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

1 Comment

Comments

apply() is 100X slower than agg()

Here's an example:

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related