Combining two csv in python - Update if exists in newer csv

Question

I am trying to combine two csv, namely to update the csv consisting of older data (old.csv) if a new one exists in the csv of newer data (new.csv). Both have the same number of columns (headings) and can be identified by an unique id.

old.csv

id,description,listing,url,default
2471582,spacex,536,www.spacex.com,0
3257236,alibaba,875,www.alibaba.com,0
3539697,ethihad,344,www.etihad.com,0
2324566,pretzel,188,www.example.com,1

new.csv

id,description,listing,url,default
2471582,spacex,888,www.spacex.com,0
3539697,ethihad,348,www.etihad.com,0
2324566,pretzel,396,www.pretzelshopexample12345.com,1

Here is what I have tried so far in Python & Pandas:

import pandas as pd
f1 = pd.read_csv('old.csv', delimiter=',')
f2 = pd.read_csv('new.csv', delimiter=',')
with open('final.csv', 'w', encoding='utf-8', newline='') as out:
    pd.merge(f1, f2, on='id', how='left').to_csv(out, sep=',', index=False)

Current output:

id,description_x,listing_x,url_x,default_x,description_y,listing_y,url_y,default_y
2471582,spacex,536,www.spacex.com,0,spacex,888.0,www.spacex.com,0.0
3257236,alibaba,875,www.alibaba.com,0,,,,
3539697,ethihad,344,www.etihad.com,0,ethihad,348.0,www.etihad.com,0.0
2324566,pretzel,188,www.example.com,1,pretzel,396.0,www.pretzelshopexample12345.com,1.0

What I am trying to achieve:

id,description,listing,url,default
2471582,spacex,888,www.spacex.com,0
3257236,alibaba,875,www.alibaba.com,0
3539697,ethihad,344,www.etihad.com,0
2324566,pretzel,396,www.pretzelshopexample12345.com,1

So I was wondering how can I use pandas to merge the two csv based on id to replace the whole row if a newer data exists in the new.csv, while keeping the remaining rows in old.csv? Thank you in advance for any help on this

What is the issue, exactly? What part are you struggling with? I see no evidence of any attempt or research. — AMC
– AMC, Commented Jan 29, 2020 at 17:53

Gary · Accepted Answer · 2020-01-29 21:18:58Z

1

This should work:

f1 = f1.set_index('id')
f2 = f2.set_index('id')
f1.update(f2)
f1.reset_index(inplace=True)

Output:

f1:

    id      description listing url                             default
0   2471582 spacex      888.0   www.spacex.com                  0.0
1   3257236 alibaba     875.0   www.alibaba.com                 0.0
2   3539697 ethihad     348.0   www.etihad.com                  0.0
3   2324566 pretzel     396.0   www.pretzelshopexample12345.com 1.0

edited Jan 29, 2020 at 21:18

answered Jan 29, 2020 at 17:43

Gary

8998 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Gary Over a year ago

@Jay T: Do let me know if this worked for you. Also accept the answer if it did!

Jay T. Over a year ago

Hi, thanks for the help. But it doesn't seem to update when say the new.csv is something like 2324566,pretzel,396,,1. When some fields become blank instead, it would not correctly reflect what is in the new.csv. Any idea why is it so?

Jay T. Over a year ago

sorry forgot to tag you in the reply comment, can't seem to edit it now. @Gary

Gary Over a year ago

@Jay T: If the value of url is NaN in f2, then the code above keeps the value of f1 for url column alone. If you want exactly what is in f2 to be copied, then replace NaN with 0 first (use f2.fillna(0,inplace=True)) and then run the above codes. Later, you can change the value of 0s to NaN if desired.

Gary Over a year ago

This code is more efficient as it does not require loops or if statements, and it is always recommended to avoid those wherever possible for a faster result.

xArbisRox · Accepted Answer · 2020-01-29 18:16:18Z

0

this will be my first StackOverflow response, so there will probably be prettier solutions coming up;)... but until then my approach works:

import pandas as pd

old_csv = pd.read_csv(r"YourPath\old.csv", index_col="id")
new_csv = pd.read_csv(r"YourPath\new.csv", index_col="id")

updated_csv = pd.DataFrame(columns = new_csv.columns)

old_ids = [x for x in old_csv.index]
new_ids = [x for x in new_csv.index]

for new_id in new_ids:
    for old_id in old_ids:
        if old_id in new_ids:
            updated_csv.loc[old_id, :] = new_csv.loc[old_id, :]
        else: 
            updated_csv.loc[old_id, :] = old_csv.loc[old_id, :]

# Use the following if you want to have the ID as column again:
updated_csv.reset_index(drop=False, inplace=True)
updated_csv.rename(columns={"index":"ID"}, copy=False, inplace=True)

So putting this in words, I am basically using Loops to iterate over the single Client IDs. I create two lists, which include the according client IDs and a new DataFrame, which has the same columns as before and will be filled using the for-loops. So if the old_id is in the list of new_ids, the script will extract the Data for that ID from the new_csv, and if the old_id is not, it will extract the Data from the old_csv.

Hope that helps, looking forward to Feedback.

answered Jan 29, 2020 at 18:16

xArbisRox

2422 silver badges8 bronze badges

3 Comments

Jay T. Over a year ago

Hi, thanks for the help. Can I ask what does [old_id, :] mean? The : after the old_id specifically?

Jay T. Over a year ago

sorry forgot to tag you in the reply comment, can't seem to edit it now.

xArbisRox Over a year ago

sorry for the late response. This is basic pandas slicing, to extract Data from the DataFrame. The normal notation is e.g. pd.DataFrame.loc[indexname, columnname], and the : means that you extract all values from that row/column accordingly. So [old_id, :] means that I extract all column values for that old_id I just iterated about. Read more about slicing and indexing here: pandas.pydata.org/pandas-docs/stable/user_guide/… However, stick to the answer of Gary. I was not aware of that buildin method pd.DataFrame.update(), its way slim and more efficient.

Collectives™ on Stack Overflow

Combining two csv in python - Update if exists in newer csv

2 Answers 2

5 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related