Loop through Multiple CSV Files and Merge with Specific Columns [Pandas]

Question

I have a list of csv files. Each file has 5 columns, with ‘id’ as the only common column (primary key). The rest 4 columns are all different.

My point of interest is the 5th (last) column, which is different for each file. I want to merge them on ‘id’.

I have tried the following code but it concatenates row wise, giving me too many duplicate ‘id’ as well as ‘NaN’ values:

filelist = glob.glob(path + "/*.csv")

li = []

for filename in filelist:

    df = pd.read_csv(filename, index_col=None, header=0, usecols=[0,5])

    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

I wanna concatenate them column wise with my point-of-interest column (5th column).

For example:

My list of files: ['df1.csv', 'df2.csv', 'df3.csv', 'df4.csv']

df1.csv has the following structure:

   ID  No1 AA
0   1   0   4
1   2   1   5
2   3   0   6

df2.csv has this structure:

   ID  No2 BB
0   2   0   5
1   3   1   6
2   4   0   7

The list goes on. My desired output would be:

    ID  AA  BB  CC  DD
0   1   4.0 NaN 0   1
1   2   5.0 5.0 1   0
2   3   6.0 6.0 1   0
3   4   NaN 7.0 1   1

Any suggestions would be appreciated. Thank you.

when reading in the data, you could set id as the index column. then run a pd.join on all the dataframes. or use the first dataframe as the left dataframe, and the other dataframes as the right and merge on the id index — sammywemmy
– sammywemmy, Commented Jan 20, 2021 at 21:57

apaolillo · Accepted Answer · 2021-01-20 22:49:02Z

1

Starting from your example, setting 'ID' as index and joining implicitly on it seems like the easiest (retrieve simply the last column by position with -1 numerical index):

import pandas as pd

filelist = [
    '/tmp/csvs/df1.csv',
    '/tmp/csvs/df2.csv',
]

result = pd.DataFrame()

for f in filelist:
    df = pd.read_csv(f, sep='\s+').set_index('ID')
    last_col = df.columns[-1]
    result = result.join(df[last_col], how='outer')
result.reset_index(inplace=True)

result

Out[1]: 
   ID   AA   BB
0   1  4.0  NaN
1   2  5.0  5.0
2   3  6.0  6.0
3   4  NaN  7.0

answered Jan 20, 2021 at 22:49

apaolillo

1352 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

noah · Accepted Answer · 2021-01-20 22:29:49Z

Merge on ID using only the first and last columns:

df = df1.iloc[:,[0,-1]].merge(df2.iloc[:,[0,-1]],on="ID",how="outer")

After the first merge you'll want just:

df = df.merge(df3.iloc[:,[0,-1]],on="ID",how="outer")

In use:

import pandas as pd

data1 = {"ID":[1,2,3], "No1":[0,1,0], "AA":[4,5,6]}
data2 = {"ID":[2,3,4], "No2":[0,1,0], "BB":[5,6,7]}
data3 = {"ID":[1,3,4], "No2":[0,1,0], "CC":[2,3,4]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)

df = df1.iloc[:,[0,-1]].merge(df2.iloc[:,[0,-1]],on="ID",how="outer")
print(df.merge(df3.iloc[:,[0,-1]],on="ID",how="outer"))

Output:

   ID   AA   BB   CC
0   1  4.0  NaN  2.0
1   2  5.0  5.0  NaN
2   3  6.0  6.0  3.0
3   4  NaN  7.0  4.0

Collectives™ on Stack Overflow

Loop through Multiple CSV Files and Merge with Specific Columns [Pandas]

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related