Concat dataframe having duplicate columns

Question

I have data frame series which looks like this:

   a    b    r
1  43  630  587    

   d    b    c
1  34  30  87

I want to create a new dataframe which looks like:

 a   b     r    d   c
 43  630  587   0   0
 0    30   0    34  87

I have used the code:

appended_data= pd.concat(appended_data, axis=0)

where the list appended_data contains the individual dataframe series as elements. Earlier when I used it with other dataset it didnt throw any error but with the new dataset its showing ValueError: Plan shapes are not aligned

Note: Earlier dataset also had duplicate columns and it worked fine then and I also updated pandas.These were the solutions I found online.

full code:

dir_list = [benign_freq_dir,malign_freq_dir]

    appended_data = []

    for l in dir_list: 
        for root, dirs, files in os.walk(l):
            #print(root)
            for name in files:

                file = open(root+"/"+name,'r') 
                print(name)
                print("\n")
                df = pd.read_csv(file,header=None,error_bad_lines=False)   #In windows and python3 always pass file object not the path directly in pd.read_csv                
                #print(df)
                df = df.rename(columns={0: 'col'})
                #print(df)   
                df = pd.DataFrame(df.col.str.split(' ',1).tolist(), columns = ['col1','col2']).T.reset_index(drop=True)          
                df = df.rename(columns=df.iloc[0]).drop(df.index[0])
                print(df)


                appended_data.append(df)
                if l==benign_freq_dir:
                    df['class']=0
                else:
                    df['class']=1

    #for l in appended_data:
    #   print(l)
    #   print(type(l))  
    appended_data= pd.concat(appended_data, axis=0,sort=False)

edit:

output for:

for dfx in appended_data: 
        print(dfx.head(2).to_dict())

Those do not look like series though. Looks like you have two dataframes — Anton vBR
– Anton vBR, Commented Oct 21, 2018 at 8:30

Tom Wojcik · Accepted Answer · 2018-10-21 09:35:40Z

3

You will need an outer join for that.

import pandas as pd

df1 = pd.DataFrame({
    'a': [43],
    'b': [630],
    'r': [587]
})

df2 = pd.DataFrame({
    'd': [34],
    'b': [30],
    'c': [87]
})

df3 = df1.merge(df2, how='outer').fillna(0)
print(df3)

Yields what you need.

      a    b      r     d     c
0  43.0  630  587.0   0.0   0.0
1   0.0   30    0.0  34.0  87.0

Docs on pd.merge
Docs on outer join

EDIT: OP, pd.concat should work as expected and Anton has proven that.

Since pd.merge was my answer, I have to stick with that.

Some pseudocode if you want to merge a list of dataframes.

def merge(lst, df=None):
    if df is None:
        df = lst.pop()
    to_be_merged = lst.pop()
    merged = df.merge(to_be_merged, how='outer')
    if lst:
        return merge(lst, merged)
    return merged.fillna(0)

df = merge(list_of_dfs)

That way you will know instantly which df is at fault because clearly there's a problem with your data. Catch the exception and use .describe() and .info() to debug this issue.

edited Oct 21, 2018 at 9:35

answered Oct 21, 2018 at 8:26

Tom Wojcik

6,2894 gold badges38 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ubuntu_noob Over a year ago

I am storing the dataframes in a list and then i need to stack them up

ubuntu_noob Over a year ago

this is giving the error-pandas.errors.MergeError: Data columns not unique: Index(['getpid', 'msgget', 'ioctl', 'mmap2', 'mprotect', 'clone', 'recv', 'close', 'munmap', 'class', 'getuid32', 'semget', 'open', 'write', 'dup', 'access', 'stat64', 'fstat64', '_llseek', 'read', 'lseek', 'fcntl64', 'flock', 'pread', 'gettimeofday', 'brk', 'sigprocmask', 'getpriority', 'getdents64', 'writev', 'ipc_subcall', 'chmod', 'sched_yield', 'pipe', 'fork', '---', '---'], dtype='object')

Anton vBR · Accepted Answer · 2018-10-21 08:32:28Z

2

You can use pd.concat. You should however pass both dataframes.

pd.concat([df1,df2], axis=0, sort=False).fillna(0) #.astype(int) for ints

#      a    b      r     d     c
#0  43.0  630  587.0   0.0   0.0
#0   0.0   30    0.0  34.0  87.0

Sample data from Tom Wojcik.

answered Oct 21, 2018 at 8:32

Anton vBR

19k6 gold badges47 silver badges47 bronze badges

9 Comments

ubuntu_noob Over a year ago

As I have mentioned i have done exactly the same but still i got the error ValueError: Plan shapes are not aligned

Anton vBR Over a year ago

@ubuntu_noob In that case I suggest you try to share some data you can play with. Just like Tom provided us. See minimal reproducible example for more info.

ubuntu_noob Over a year ago

That is the type of data I have...thats why its confusing...it worked before but now with the new dataset its not

Anton vBR Over a year ago

@ubuntu_noob Yes but you should share your data as a verifiable example. If you for instance look at Tom's code: it is runnable. If you create a runnable example of your problem and point out what is wrong you get help faster and it serves the community.

ubuntu_noob Over a year ago

Yes I understand your points and they are valid ones too...the example provided by Tom is actually a correct representation of the data I am having....could you provide some help in identifying where the problem is with my data?

|

Collectives™ on Stack Overflow

Concat dataframe having duplicate columns

2 Answers 2

2 Comments

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related