Join two csv files with pandas/python without duplicates

Question

I would like to concatenate 2 csv files. Each CSV file has the following structure:

File 1

id,name,category-id,lat,lng 4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431 4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353 51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735

File 2

id,name,category-id,lat,lng 4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432 4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353 51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735 5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034

I got a final csv that look like

Final file

id,name,category-id,lat,lng 4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431 4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432 4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353 51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735 5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034

So I have done this:

import pandas as pd

df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")

full_df = pd.concat(df1,df2)

full_df = full_df.groupby(['id','category_id','lat','lng']).count()

full_df2 = full_df[['id','category_id']].groupby('id').agg('count')

full_df2.to_csv("final.csv",index=False)

I tried to groupby by id, categoy_id, lat and lng, the name could change After the first groupby I want to groupby again but now by id and category_id because as showed in my example the first row changed in long but that is probably because file2 is an update of file1

I don't understand about groupby because when i tried to print I got just the count value.

I edited the file @shivsn

l4nd0
– l4nd0

2016-07-03 19:36:22 +00:00
Commented Jul 3, 2016 at 19:36 — l4nd0
– l4nd0, Commented Jul 3, 2016 at 19:36

shawnheide · Accepted Answer · 2016-07-03 20:03:31Z

5

One way to solve this problem is to just use df.drop_duplicates() after you have concatenated the two DataFrames. Additionally, drop_duplicates has an argument "keep", which allows you to specify that you want to keep the last occurrence of the duplicates.

full_df = pd.concat([df1,df2])
unique_df = full_df.drop_duplicates(keep='last')

Check the documentation for drop_duplicates if you need further help.

edited Jul 3, 2016 at 20:03

answered Jul 3, 2016 at 19:22

shawnheide

8175 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Stevoisiak May 16 at 18:43

The keep parameter is useful, although I don't think it's relevant to this question

l4nd0 · Accepted Answer · 2016-07-03 21:28:41Z

1

I could resolve this problemen with the next code:

import pandas as pd

df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")

df_final=pd.concat([df1,df2]).drop_duplicates(subset=['id','category_id','lat','lng']).reset_index(drop=True)
print(df_final.shape)

df_final2=df_final.drop_duplicates(subset=['id','category_id']).reset_index(drop=True)

df_final2.to_csv('final', index=False)

answered Jul 3, 2016 at 21:28

l4nd0

671 silver badge6 bronze badges

Collectives™ on Stack Overflow

Join two csv files with pandas/python without duplicates

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related