0

I have a list of csv files. Each file has 5 columns, with ‘id’ as the only common column (primary key). The rest 4 columns are all different.

My point of interest is the 5th (last) column, which is different for each file. I want to merge them on ‘id’.

I have tried the following code but it concatenates row wise, giving me too many duplicate ‘id’ as well as ‘NaN’ values:

filelist = glob.glob(path + "/*.csv")

li = []

for filename in filelist:

    df = pd.read_csv(filename, index_col=None, header=0, usecols=[0,5])

    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

I wanna concatenate them column wise with my point-of-interest column (5th column).

For example:

My list of files: ['df1.csv', 'df2.csv', 'df3.csv', 'df4.csv']

df1.csv has the following structure:

   ID  No1 AA
0   1   0   4
1   2   1   5
2   3   0   6

df2.csv has this structure:

   ID  No2 BB
0   2   0   5
1   3   1   6
2   4   0   7

The list goes on. My desired output would be:

    ID  AA  BB  CC  DD
0   1   4.0 NaN 0   1
1   2   5.0 5.0 1   0
2   3   6.0 6.0 1   0
3   4   NaN 7.0 1   1

Any suggestions would be appreciated. Thank you.

1
  • when reading in the data, you could set id as the index column. then run a pd.join on all the dataframes. or use the first dataframe as the left dataframe, and the other dataframes as the right and merge on the id index Commented Jan 20, 2021 at 21:57

2 Answers 2

1

Starting from your example, setting 'ID' as index and joining implicitly on it seems like the easiest (retrieve simply the last column by position with -1 numerical index):

import pandas as pd

filelist = [
    '/tmp/csvs/df1.csv',
    '/tmp/csvs/df2.csv',
]

result = pd.DataFrame()

for f in filelist:
    df = pd.read_csv(f, sep='\s+').set_index('ID')
    last_col = df.columns[-1]
    result = result.join(df[last_col], how='outer')
result.reset_index(inplace=True)

result

Out[1]: 
   ID   AA   BB
0   1  4.0  NaN
1   2  5.0  5.0
2   3  6.0  6.0
3   4  NaN  7.0
Sign up to request clarification or add additional context in comments.

Comments

1

Merge on ID using only the first and last columns:

df = df1.iloc[:,[0,-1]].merge(df2.iloc[:,[0,-1]],on="ID",how="outer")

After the first merge you'll want just:

df = df.merge(df3.iloc[:,[0,-1]],on="ID",how="outer")

In use:

import pandas as pd

data1 = {"ID":[1,2,3], "No1":[0,1,0], "AA":[4,5,6]}
data2 = {"ID":[2,3,4], "No2":[0,1,0], "BB":[5,6,7]}
data3 = {"ID":[1,3,4], "No2":[0,1,0], "CC":[2,3,4]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)

df = df1.iloc[:,[0,-1]].merge(df2.iloc[:,[0,-1]],on="ID",how="outer")
print(df.merge(df3.iloc[:,[0,-1]],on="ID",how="outer"))

Output:

   ID   AA   BB   CC
0   1  4.0  NaN  2.0
1   2  5.0  5.0  NaN
2   3  6.0  6.0  3.0
3   4  NaN  7.0  4.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.