How to concatenate multiple column values into a single column in Pandas dataframe

Question

This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:

Here is the combining two columns:

df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)

df
    bar foo new combined
0   1   a   apple   a_1
1   2   b   banana  b_2
2   3   c   pear    c_3

I want to combine three columns with this command but it is not working, any idea?

df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)

if you want to concat 3 columns you need 3 %s. (%s_%s_%s) like df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1) — user2652620
– user2652620, Commented Nov 9, 2017 at 14:33
Possible duplicate of String concatenation of two pandas columns — MrFun
– MrFun, Commented Mar 18, 2019 at 3:10
A more comprehensive answer showing timings for multiple approaches is Combine two columns of text in pandas dataframe — smci
– smci, Commented Mar 13, 2021 at 4:16
Your reference post later has df.astype(str).agg('_'.join, axis=1). — Ynjxsjmh
– Ynjxsjmh, Commented Apr 20, 2022 at 7:33

Allen · Accepted Answer · 2018-09-11 06:53:44Z

201

Another solution using DataFrame.apply(), with slightly less typing and more scalable when you want to join more columns:

cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

answered Sep 11, 2018 at 6:53

Allen

2,4951 gold badge17 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

M_Idk392845 Over a year ago

This is the best solution when the column list is saved as a variable and can hold a different amount of columns every time

grofte Over a year ago

Tiny gotcha I ran into was that .values.astype(str) converts None into the string 'None' rather than an empty string. Apparently.

Pierre D Over a year ago

without lambda (faster and more concise): df[cols].astype(str).apply('_'.join, axis=1). That said, using .str.cat(...).str.cat(...)... is faster still.

Jordan Ryder Feb 25 at 17:51

As someone relatively new to python, it still amazes me how many problems can be solved by just adding in the axis argument.

Bill the Lizard · Accepted Answer · 2022-07-19 16:40:10Z

118

You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.

In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']

In[17]:df
Out[18]: 
   bar foo     new    combined
0    1   a   apple   1_a_apple
1    2   b  banana  2_b_banana
2    3   c    pear    3_c_pear

edited Jul 19, 2022 at 16:40

Bill the Lizard

407k213 gold badges579 silver badges892 bronze badges

answered Sep 2, 2016 at 11:43

shivsn

7,9141 gold badge28 silver badges34 bronze badges

7 Comments

MaxU - stand with Ukraine Over a year ago

this solution will be much faster compared to the .apply(, axis=1) one on bigger DFs

shivsn Over a year ago

@MaxU yeah and its very easy.

Nate Over a year ago

I'm getting a SettingWithCopyWarning when I use this solution - how could I avoid triggering that warning?

derchambers Over a year ago

This gets annoying when you need to join many columns, however.

Avantika Banerjee Over a year ago

If any of the columns are None, df['combined'] becomes nan. Example: if df.new.iloc[0] == None, then df.combined.iloc[0] becomes nan, instead of 1_a_

|

cbrnr · Accepted Answer · 2018-05-24 08:39:07Z

31

If you have even more columns you want to combine, using the Series method str.cat might be handy:

df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")

Basically, you select the first column (if it is not already of type str, you need to append .astype(str)), to which you append the other columns (separated by an optional separator character).

answered May 24, 2018 at 8:39

cbrnr

1,8311 gold badge19 silver badges33 bronze badges

4 Comments

Corey Levinson Over a year ago

Clever, but this caused a huge memory error for me. Tedious as it may be, writing df[col].map(str) + '_' df[col2].map(str) + ... + df[col9].map(str) is way more efficient.

MaxU - stand with Ukraine Over a year ago

It's interesting! I didn't know we can use DataFrame as an argument in Series.str.cat()

avirr Over a year ago

This is by far the easiest for me, and I like the sep parameter

citynorman Over a year ago

No memory issues for me. Has to add df["foo"].fillna('').

MaxU - stand with Ukraine · Accepted Answer · 2016-09-02 13:24:17Z

18

Just wanted to make a time comparison for both solutions (for 30K rows DF):

In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

In [2]: big = pd.concat([df] * 10**4, ignore_index=True)

In [3]: big.shape
Out[3]: (30000, 3)

In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop

In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop

a few more options:

In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop

In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop

answered Sep 2, 2016 at 13:24

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

1 Comment

shivsn Over a year ago

Very nice with additional options.

krassowski · Accepted Answer · 2020-06-01 15:42:46Z

Possibly the fastest solution is to operate in plain Python:

Series(
    map(
        '_'.join,
        df.values.tolist()
        # when non-string columns are present:
        # df.values.astype(str).tolist()
    ),
    index=df.index
)

Comparison against @MaxU answer (using the big data frame which has both numeric and string columns):

%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comparison against @derchambers answer (using their df data frame where all columns are strings):

from functools import reduce

def reduce_join(df, columns):
    slist = [df[x] for x in columns]
    return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])

def list_map(df, columns):
    return Series(
        map(
            '_'.join,
            df[columns].values.tolist()
        ),
        index=df.index
    )

%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

derchambers · Accepted Answer · 2020-04-17 20:45:49Z

9

The answer given by @allen is reasonably generic but can lack in performance for larger dataframes:

Reduce does a lot better:

from functools import reduce

import pandas as pd

# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'


def reduce_join(df, columns):
    assert len(columns) > 1
    slist = [df[x].astype(str) for x in columns]
    return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])


def apply_join(df, columns):
    assert len(columns) > 1
    return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)

# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)

# profile
%timeit df1 = reduce_join(df, list('1234'))  # 733 ms
%timeit df2 = apply_join(df, list('1234'))   # 8.84 s

edited Apr 17, 2020 at 20:45

answered Apr 17, 2020 at 20:32

derchambers

95413 silver badges20 bronze badges

2 Comments

Yang Over a year ago

Is there a way to not abandon the empty cells, without adding a separator, for example, the strings to join is "", "a" and "b", the expected result is "_a_b", but is it possible to have "a_b". I couldn't find a way to do this efficiently, because it requires row wise operation, since the length of each row is different.

derchambers Over a year ago

I am not sure what you mean @Yang, maybe post a new question with a workable example?

milos.ai · Accepted Answer · 2016-09-02 11:43:28Z

8

I think you are missing one %s

df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)

answered Sep 2, 2016 at 11:43

milos.ai

3,9287 gold badges34 silver badges33 bronze badges

Comments

Jane Kathambi · Accepted Answer · 2022-02-28 08:32:59Z

7

First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here

# Initialize columns
cols_concat = ['first_name', 'second_name']

# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')

# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)

answered Feb 28, 2022 at 8:32

Jane Kathambi

96510 silver badges10 bronze badges

1 Comment

Bowen Liu Over a year ago

Great comment. But did you have to use transpose at the end? Is it the same as .agg(''.join, axis=1)? Thanks

Daniil Balabanov · Accepted Answer · 2020-11-27 14:53:34Z

3

If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do

def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
    df[new_col_name] = df[cols_to_concat[0]]
    for col in cols_to_concat[1:]:
        df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)

This should be faster than apply and takes an arbitrary number of columns to concatenate.

answered Nov 27, 2020 at 14:53

Daniil Balabanov

1171 silver badge8 bronze badges

Comments

Manivannan Murugavel · Accepted Answer · 2018-04-19 07:59:28Z

2

df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)

If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.

edited Apr 19, 2018 at 7:59

answered Apr 18, 2018 at 10:10

Manivannan Murugavel

1,59617 silver badges16 bronze badges

Comments

Papershine · Accepted Answer · 2018-10-12 13:10:05Z

2

df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']

X= x is any delimiter (eg: space) by which you want to separate two merged column.

edited Oct 12, 2018 at 13:10

Papershine

5,2332 gold badges27 silver badges48 bronze badges

answered Oct 12, 2018 at 13:06

Nipun Kumar Goel

2012 silver badges7 bronze badges

Comments

Grzegorz · Accepted Answer · 2021-04-26 07:46:14Z

2

@derchambers I found one more solution:

import pandas as pd

# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'

def eval_join(df, columns):

    sum_elements = [f"df['{col}']" for col in columns]
    to_eval = "+ '_' + ".join(sum_elements)

    return eval(to_eval)


#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms

edited Apr 26, 2021 at 7:46

answered Apr 22, 2020 at 12:44

Grzegorz

1,40314 silver badges13 bronze badges

Comments

StephenOK · Accepted Answer · 2021-12-02 13:03:07Z

2

You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):

def concat_cols(df, cols_to_concat, new_col_name, separator):  
    df[new_col_name] = ''
    for i, col in enumerate(cols_to_concat):
        df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
    return df

Sample usage:

test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')

answered Dec 2, 2021 at 13:03

StephenOK

212 bronze badges

Comments

Gonçalo Peres · Accepted Answer · 2022-09-20 09:49:26Z

Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s', not just two '%s_%s'. The following will do the work

df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)

[Out]:
  foo  bar     new    combined
0   a    1   apple   a_1_apple
1   b    2  banana  b_2_banana
2   c    3    pear    c_3_pear

Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.

columns = ['foo', 'bar', 'new']

df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)

[Out]:
  foo  bar     new    combined
0   a    1   apple   a_1_apple
1   b    2  banana  b_2_banana
2   c    3    pear    c_3_pear

This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.

3UqU57GnaX · Accepted Answer · 2024-08-07 11:23:29Z

If you want to join many columns in a large Dataframe, the fastest option is to write out a tedious statement:

df['new_col'] = df['col1'] + df['col2'] + ... + df['coln']

Here is a function that writes the statement for you.

def create_eval_statement(df_variable_name, columns, separator="_"):
    columns_strings = [f"{df_variable_name}['c']" for c in columns]
    return f" + '{separator}' + ".join(columns_strings)


stmt = create_eval_statement("df", ["col1", "col2"])
df["new_col"] = eval(stmt)

Code runs in 4.4 seconds for a Dataframe with 3 million rows and 17 columns.

Fast alternative with little code (6.0 seconds):

df[new_col] = df[columns].add(sep).sum(axis=1).str[:-1]

I listed and timed several options with the script below:

import timeit

import pandas as pd

# Create dataframe with 17 columns and 3 million rows, all strings
df = pd.DataFrame({chr(i + 65): [chr(i + 97)] * 3_000_000 for i in range(17)})

columns = list(df.columns)
sep = "_"
new_col = "new"


def create_exec_statement(
    df_variable_name="df",
    columns_variable_name="columns",
    new_column_name="new",
    separator="_",
):
    columns_strings = [
        f"{df_variable_name}[{columns_variable_name}[{i}]]"
        for i in range(len(eval(columns_variable_name)))
    ]
    separator = f" + '{separator}' + "
    statement = (
        f'{df_variable_name}["{new_column_name}"] = {separator.join(columns_strings)}'
    )
    return statement


def f1():
    exec(
        create_exec_statement(
            df_variable_name="df",
            columns_variable_name="columns",
            new_column_name=new_col,
            separator=sep,
        )
    )


def f2():
    df[new_col] = df[columns[0]].str.cat(df[columns[1:]], sep=sep)


def f3():
    df[new_col] = df[columns].T.add(sep).sum().str[:-1]


def f4():
    df[new_col] = df[columns].add(sep).sum(axis=1).str[:-1]


def f5():
    df[new_col] = df[columns].apply(lambda x: sep.join(x), axis=1)


def f6():
    df[new_col] = df[columns].agg(sep.join, axis=1)


def f7():
    df[new_col] = df[columns].T.agg(sep.join)


if __name__ == "__main__":
    for func in [f1, f2, f3, f4, f5, f6, f7]:
        print(f"{func.__name__}: {timeit.repeat(func, number=1, repeat=3)}")

# Results
# f1: [4.366812400025083, 4.43233589999727, 4.370704000000842]
# f2: [5.970817499997793, 5.898356199992122, 5.80382699999609]
# f3: [5.981191200000467, 5.959296400018502, 5.963758500001859]
# f4: [5.967713599995477, 6.032882600004086, 6.010665400011931]
# f5: [11.023198500013677, 10.792945499997586, 10.91107919998467]
# f6: [10.698224400024628, 10.668694899999537, 10.707435600023018]
# f7: [31.499697799998103, 31.31905089999782, 31.4950811000017]

Antiez · Accepted Answer · 2022-09-02 16:25:21Z

0

following to @Allen response
If you need to chain such operation with other dataframe transformation, use assign:

df.assign(
    combined = lambda x: x[cols].apply(
        lambda row: "_".join(row.values.astype(str)), axis=1
  )
)

answered Sep 2, 2022 at 16:25

Antiez

9779 silver badges15 bronze badges

Collectives™ on Stack Overflow

How to concatenate multiple column values into a single column in Pandas dataframe

16 Answers 16

4 Comments

7 Comments

4 Comments

1 Comment

Comments

2 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

16 Answers 16

4 Comments

7 Comments

4 Comments

1 Comment

Comments

2 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related