0

I am trying to combine two csv, namely to update the csv consisting of older data (old.csv) if a new one exists in the csv of newer data (new.csv). Both have the same number of columns (headings) and can be identified by an unique id.

old.csv

id,description,listing,url,default
2471582,spacex,536,www.spacex.com,0
3257236,alibaba,875,www.alibaba.com,0
3539697,ethihad,344,www.etihad.com,0
2324566,pretzel,188,www.example.com,1

new.csv

id,description,listing,url,default
2471582,spacex,888,www.spacex.com,0
3539697,ethihad,348,www.etihad.com,0
2324566,pretzel,396,www.pretzelshopexample12345.com,1

Here is what I have tried so far in Python & Pandas:

import pandas as pd
f1 = pd.read_csv('old.csv', delimiter=',')
f2 = pd.read_csv('new.csv', delimiter=',')
with open('final.csv', 'w', encoding='utf-8', newline='') as out:
    pd.merge(f1, f2, on='id', how='left').to_csv(out, sep=',', index=False)

Current output:

id,description_x,listing_x,url_x,default_x,description_y,listing_y,url_y,default_y
2471582,spacex,536,www.spacex.com,0,spacex,888.0,www.spacex.com,0.0
3257236,alibaba,875,www.alibaba.com,0,,,,
3539697,ethihad,344,www.etihad.com,0,ethihad,348.0,www.etihad.com,0.0
2324566,pretzel,188,www.example.com,1,pretzel,396.0,www.pretzelshopexample12345.com,1.0

What I am trying to achieve:

id,description,listing,url,default
2471582,spacex,888,www.spacex.com,0
3257236,alibaba,875,www.alibaba.com,0
3539697,ethihad,344,www.etihad.com,0
2324566,pretzel,396,www.pretzelshopexample12345.com,1

So I was wondering how can I use pandas to merge the two csv based on id to replace the whole row if a newer data exists in the new.csv, while keeping the remaining rows in old.csv? Thank you in advance for any help on this

1
  • What is the issue, exactly? What part are you struggling with? I see no evidence of any attempt or research. Commented Jan 29, 2020 at 17:53

2 Answers 2

1

This should work:

f1 = f1.set_index('id')
f2 = f2.set_index('id')
f1.update(f2)
f1.reset_index(inplace=True)

Output:

f1:

    id      description listing url                             default
0   2471582 spacex      888.0   www.spacex.com                  0.0
1   3257236 alibaba     875.0   www.alibaba.com                 0.0
2   3539697 ethihad     348.0   www.etihad.com                  0.0
3   2324566 pretzel     396.0   www.pretzelshopexample12345.com 1.0
Sign up to request clarification or add additional context in comments.

5 Comments

@Jay T: Do let me know if this worked for you. Also accept the answer if it did!
Hi, thanks for the help. But it doesn't seem to update when say the new.csv is something like 2324566,pretzel,396,,1. When some fields become blank instead, it would not correctly reflect what is in the new.csv. Any idea why is it so?
sorry forgot to tag you in the reply comment, can't seem to edit it now. @Gary
@Jay T: If the value of url is NaN in f2, then the code above keeps the value of f1 for url column alone. If you want exactly what is in f2 to be copied, then replace NaN with 0 first (use f2.fillna(0,inplace=True)) and then run the above codes. Later, you can change the value of 0s to NaN if desired.
This code is more efficient as it does not require loops or if statements, and it is always recommended to avoid those wherever possible for a faster result.
0

this will be my first StackOverflow response, so there will probably be prettier solutions coming up;)... but until then my approach works:

import pandas as pd

old_csv = pd.read_csv(r"YourPath\old.csv", index_col="id")
new_csv = pd.read_csv(r"YourPath\new.csv", index_col="id")

updated_csv = pd.DataFrame(columns = new_csv.columns)

old_ids = [x for x in old_csv.index]
new_ids = [x for x in new_csv.index]

for new_id in new_ids:
    for old_id in old_ids:
        if old_id in new_ids:
            updated_csv.loc[old_id, :] = new_csv.loc[old_id, :]
        else: 
            updated_csv.loc[old_id, :] = old_csv.loc[old_id, :]

# Use the following if you want to have the ID as column again:
updated_csv.reset_index(drop=False, inplace=True)
updated_csv.rename(columns={"index":"ID"}, copy=False, inplace=True)

So putting this in words, I am basically using Loops to iterate over the single Client IDs. I create two lists, which include the according client IDs and a new DataFrame, which has the same columns as before and will be filled using the for-loops. So if the old_id is in the list of new_ids, the script will extract the Data for that ID from the new_csv, and if the old_id is not, it will extract the Data from the old_csv.

Hope that helps, looking forward to Feedback.

3 Comments

Hi, thanks for the help. Can I ask what does [old_id, :] mean? The : after the old_id specifically?
sorry forgot to tag you in the reply comment, can't seem to edit it now.
sorry for the late response. This is basic pandas slicing, to extract Data from the DataFrame. The normal notation is e.g. pd.DataFrame.loc[indexname, columnname], and the : means that you extract all values from that row/column accordingly. So [old_id, :] means that I extract all column values for that old_id I just iterated about. Read more about slicing and indexing here: pandas.pydata.org/pandas-docs/stable/user_guide/… However, stick to the answer of Gary. I was not aware of that buildin method pd.DataFrame.update(), its way slim and more efficient.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.