2

I would like to concatenate 2 csv files. Each CSV file has the following structure:

File 1

id,name,category-id,lat,lng 4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431 4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353 51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735

File 2

id,name,category-id,lat,lng 4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432 4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353 51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735 5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034

I got a final csv that look like

Final file

id,name,category-id,lat,lng 4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431 4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432 4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353 51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735 5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034

So I have done this:

import pandas as pd

df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")

full_df = pd.concat(df1,df2)

full_df = full_df.groupby(['id','category_id','lat','lng']).count()

full_df2 = full_df[['id','category_id']].groupby('id').agg('count')

full_df2.to_csv("final.csv",index=False)

I tried to groupby by id, categoy_id, lat and lng, the name could change After the first groupby I want to groupby again but now by id and category_id because as showed in my example the first row changed in long but that is probably because file2 is an update of file1

I don't understand about groupby because when i tried to print I got just the count value.

1
  • I edited the file @shivsn Commented Jul 3, 2016 at 19:36

2 Answers 2

5

One way to solve this problem is to just use df.drop_duplicates() after you have concatenated the two DataFrames. Additionally, drop_duplicates has an argument "keep", which allows you to specify that you want to keep the last occurrence of the duplicates.

full_df = pd.concat([df1,df2])
unique_df = full_df.drop_duplicates(keep='last')

Check the documentation for drop_duplicates if you need further help.

Sign up to request clarification or add additional context in comments.

1 Comment

The keep parameter is useful, although I don't think it's relevant to this question
1

I could resolve this problemen with the next code:

import pandas as pd

df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")

df_final=pd.concat([df1,df2]).drop_duplicates(subset=['id','category_id','lat','lng']).reset_index(drop=True)
print(df_final.shape)

df_final2=df_final.drop_duplicates(subset=['id','category_id']).reset_index(drop=True)

df_final2.to_csv('final', index=False)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.