136

This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:

Here is the combining two columns:

df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)

df
    bar foo new combined
0   1   a   apple   a_1
1   2   b   banana  b_2
2   3   c   pear    c_3

I want to combine three columns with this command but it is not working, any idea?

df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
4
  • 6
    if you want to concat 3 columns you need 3 %s. (%s_%s_%s) like df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1) Commented Nov 9, 2017 at 14:33
  • 2
    Possible duplicate of String concatenation of two pandas columns Commented Mar 18, 2019 at 3:10
  • 2
    A more comprehensive answer showing timings for multiple approaches is Combine two columns of text in pandas dataframe Commented Mar 13, 2021 at 4:16
  • Your reference post later has df.astype(str).agg('_'.join, axis=1). Commented Apr 20, 2022 at 7:33

16 Answers 16

201

Another solution using DataFrame.apply(), with slightly less typing and more scalable when you want to join more columns:

cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
Sign up to request clarification or add additional context in comments.

4 Comments

This is the best solution when the column list is saved as a variable and can hold a different amount of columns every time
Tiny gotcha I ran into was that .values.astype(str) converts None into the string 'None' rather than an empty string. Apparently.
without lambda (faster and more concise): df[cols].astype(str).apply('_'.join, axis=1). That said, using .str.cat(...).str.cat(...)... is faster still.
As someone relatively new to python, it still amazes me how many problems can be solved by just adding in the axis argument.
118

You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.

In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']

In[17]:df
Out[18]: 
   bar foo     new    combined
0    1   a   apple   1_a_apple
1    2   b  banana  2_b_banana
2    3   c    pear    3_c_pear

7 Comments

this solution will be much faster compared to the .apply(, axis=1) one on bigger DFs
@MaxU yeah and its very easy.
I'm getting a SettingWithCopyWarning when I use this solution - how could I avoid triggering that warning?
This gets annoying when you need to join many columns, however.
If any of the columns are None, df['combined'] becomes nan. Example: if df.new.iloc[0] == None, then df.combined.iloc[0] becomes nan, instead of 1_a_
|
31

If you have even more columns you want to combine, using the Series method str.cat might be handy:

df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")

Basically, you select the first column (if it is not already of type str, you need to append .astype(str)), to which you append the other columns (separated by an optional separator character).

4 Comments

Clever, but this caused a huge memory error for me. Tedious as it may be, writing df[col].map(str) + '_' df[col2].map(str) + ... + df[col9].map(str) is way more efficient.
It's interesting! I didn't know we can use DataFrame as an argument in Series.str.cat()
This is by far the easiest for me, and I like the sep parameter
No memory issues for me. Has to add df["foo"].fillna('').
18

Just wanted to make a time comparison for both solutions (for 30K rows DF):

In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

In [2]: big = pd.concat([df] * 10**4, ignore_index=True)

In [3]: big.shape
Out[3]: (30000, 3)

In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop

In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop

a few more options:

In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop

In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop

1 Comment

Very nice with additional options.
12

Possibly the fastest solution is to operate in plain Python:

Series(
    map(
        '_'.join,
        df.values.tolist()
        # when non-string columns are present:
        # df.values.astype(str).tolist()
    ),
    index=df.index
)

Comparison against @MaxU answer (using the big data frame which has both numeric and string columns):

%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comparison against @derchambers answer (using their df data frame where all columns are strings):

from functools import reduce

def reduce_join(df, columns):
    slist = [df[x] for x in columns]
    return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])

def list_map(df, columns):
    return Series(
        map(
            '_'.join,
            df[columns].values.tolist()
        ),
        index=df.index
    )

%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comments

9

The answer given by @allen is reasonably generic but can lack in performance for larger dataframes:

Reduce does a lot better:

from functools import reduce

import pandas as pd

# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'


def reduce_join(df, columns):
    assert len(columns) > 1
    slist = [df[x].astype(str) for x in columns]
    return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])


def apply_join(df, columns):
    assert len(columns) > 1
    return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)

# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)

# profile
%timeit df1 = reduce_join(df, list('1234'))  # 733 ms
%timeit df2 = apply_join(df, list('1234'))   # 8.84 s

2 Comments

Is there a way to not abandon the empty cells, without adding a separator, for example, the strings to join is "", "a" and "b", the expected result is "_a_b", but is it possible to have "a_b". I couldn't find a way to do this efficiently, because it requires row wise operation, since the length of each row is different.
I am not sure what you mean @Yang, maybe post a new question with a workable example?
8

I think you are missing one %s

df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)

Comments

7

First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here

# Initialize columns
cols_concat = ['first_name', 'second_name']

# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')

# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)

1 Comment

Great comment. But did you have to use transpose at the end? Is it the same as .agg(''.join, axis=1)? Thanks
3

If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do

def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
    df[new_col_name] = df[cols_to_concat[0]]
    for col in cols_to_concat[1:]:
        df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)

This should be faster than apply and takes an arbitrary number of columns to concatenate.

Comments

2
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)

If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.

Comments

2
df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']

X= x is any delimiter (eg: space) by which you want to separate two merged column.

Comments

2

@derchambers I found one more solution:

import pandas as pd

# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'

def eval_join(df, columns):

    sum_elements = [f"df['{col}']" for col in columns]
    to_eval = "+ '_' + ".join(sum_elements)

    return eval(to_eval)


#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms

Comments

2

You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):

def concat_cols(df, cols_to_concat, new_col_name, separator):  
    df[new_col_name] = ''
    for i, col in enumerate(cols_to_concat):
        df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
    return df

Sample usage:

test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')

Comments

2

Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s', not just two '%s_%s'. The following will do the work

df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)

[Out]:
  foo  bar     new    combined
0   a    1   apple   a_1_apple
1   b    2  banana  b_2_banana
2   c    3    pear    c_3_pear

Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.

columns = ['foo', 'bar', 'new']

df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)

[Out]:
  foo  bar     new    combined
0   a    1   apple   a_1_apple
1   b    2  banana  b_2_banana
2   c    3    pear    c_3_pear

This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.

Comments

1

If you want to join many columns in a large Dataframe, the fastest option is to write out a tedious statement:

df['new_col'] = df['col1'] + df['col2'] + ... + df['coln']

Here is a function that writes the statement for you.

def create_eval_statement(df_variable_name, columns, separator="_"):
    columns_strings = [f"{df_variable_name}['c']" for c in columns]
    return f" + '{separator}' + ".join(columns_strings)


stmt = create_eval_statement("df", ["col1", "col2"])
df["new_col"] = eval(stmt)

Code runs in 4.4 seconds for a Dataframe with 3 million rows and 17 columns.

Fast alternative with little code (6.0 seconds):

  • df[new_col] = df[columns].add(sep).sum(axis=1).str[:-1]

I listed and timed several options with the script below:

import timeit

import pandas as pd

# Create dataframe with 17 columns and 3 million rows, all strings
df = pd.DataFrame({chr(i + 65): [chr(i + 97)] * 3_000_000 for i in range(17)})

columns = list(df.columns)
sep = "_"
new_col = "new"


def create_exec_statement(
    df_variable_name="df",
    columns_variable_name="columns",
    new_column_name="new",
    separator="_",
):
    columns_strings = [
        f"{df_variable_name}[{columns_variable_name}[{i}]]"
        for i in range(len(eval(columns_variable_name)))
    ]
    separator = f" + '{separator}' + "
    statement = (
        f'{df_variable_name}["{new_column_name}"] = {separator.join(columns_strings)}'
    )
    return statement


def f1():
    exec(
        create_exec_statement(
            df_variable_name="df",
            columns_variable_name="columns",
            new_column_name=new_col,
            separator=sep,
        )
    )


def f2():
    df[new_col] = df[columns[0]].str.cat(df[columns[1:]], sep=sep)


def f3():
    df[new_col] = df[columns].T.add(sep).sum().str[:-1]


def f4():
    df[new_col] = df[columns].add(sep).sum(axis=1).str[:-1]


def f5():
    df[new_col] = df[columns].apply(lambda x: sep.join(x), axis=1)


def f6():
    df[new_col] = df[columns].agg(sep.join, axis=1)


def f7():
    df[new_col] = df[columns].T.agg(sep.join)


if __name__ == "__main__":
    for func in [f1, f2, f3, f4, f5, f6, f7]:
        print(f"{func.__name__}: {timeit.repeat(func, number=1, repeat=3)}")

# Results
# f1: [4.366812400025083, 4.43233589999727, 4.370704000000842]
# f2: [5.970817499997793, 5.898356199992122, 5.80382699999609]
# f3: [5.981191200000467, 5.959296400018502, 5.963758500001859]
# f4: [5.967713599995477, 6.032882600004086, 6.010665400011931]
# f5: [11.023198500013677, 10.792945499997586, 10.91107919998467]
# f6: [10.698224400024628, 10.668694899999537, 10.707435600023018]
# f7: [31.499697799998103, 31.31905089999782, 31.4950811000017]

Comments

0

following to @Allen response
If you need to chain such operation with other dataframe transformation, use assign:

df.assign(
    combined = lambda x: x[cols].apply(
        lambda row: "_".join(row.values.astype(str)), axis=1
  )
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.