5

I am wondering if I could build such a module in Pandas:

    def concatenate(df,columnlist,newcolumn):
        # df is the dataframe and
        # columnlist is the list contains the column names of all the columns I want to concatnate
        # newcolumn is the name of the resulted new column

        for c in columnlist:
            ...some Pandas functions

        return df # this one has the concatenated "newcolumn"

I am asking this because that len(columnlist) is going to be very big and dynamic. Thanks!

2

2 Answers 2

10

Try this:

import numpy as np
np.add.reduce(df[columnlist], axis=1)

What this does is to "add" the values in each row, which for strings means to concatenate them ("abc" + "de" == "abcde").


Originally I thought you wanted to concatenate them lengthwise, into a single longer series of all the values. If anyone else wants to do that, here's the code:

pd.concat(map(df.get, columnlist)).reset_index(drop=True)
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks John! I guess you misunderstood my original request @John Zwinck: if Column A is "ABC" and Column B is "XYZ" my newcolumn should be "ABCXYZ". The newcolumn has the exact length of the dataframe.
@LarryZ: I see. I've changed my answer.
Thanks, @John Zwinck. It worked! It seems this method requires all the columns to be str, when any column contains int or float it will give the following error: " TypeError: must be str, not float "
@LarryZ: You can fix that by np.add.reduce(df[columnlist].astype(str), axis=1).
Thanks, man! This is the answer, period! A shameless followup question: What if I also want to add a "separator" between columns? i.e. instead of "ABCXYZ" I want "ABC XYZ"? A dumb way is to add a new column called "Space" - contains nothing but one space " ", then insert the column name "Sapace" to my columnlist where necessary, it worked fine. Is there a more Pythonic way to do this?
|
10

Given a dataframe like this:

df

     A    B
0  aaa  ddd
1  bbb  eee
2  ccc  fff

You can just use df.sum, given every column is a string column:

df.sum(1)

0    aaaddd
1    bbbeee
2    cccfff
dtype: object

If you need to perform a conversion, you can do so:

df.astype(str).sum(1)

If you need to select a subset of your data (only string columns?), you can use select_dtypes:

df.select_dtypes(include=['str']).sum(1)

If you need to select by columns, this should do:

df[['A', 'B']].sum(1)

In every case, the addition is not inplace, so if you want to persist your result, please assign it back:

r = df.sum(1)

5 Comments

Thanks, @COLDSPEED. Your solution appears promising. I tried "df.select_dtypes(include=['str']).sum(1)" but get this error below: File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2369, in select_dtypes invalidate_string_dtypes(dtypes) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 497, in invalidate_string_dtypes raise TypeError("string dtypes are not allowed, use 'object' instead") TypeError: string dtypes are not allowed, use 'object' instead
Then when I change the code to df.select_dtypes(include=['object']).sum(1), it gave no error but the result is one column with all "0". Any idea why? Thanks!
@LarryZ what are your column types initially?
@COLDSPEED Thanks for the followup. A number of the columns contains mixed data type, both str and int. These columns are labeled as "object"
@LarryZ Select_dtypes may not work but everything else should.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.