2

I have two CSV files representing data from two different years. I know how to do the basic merging using csvwriter and dictkeys, but the problem lies here: while the CSVs have mostly shared column headers, each may have unique columns. If a species was caught in one year but not the other, that column would only be present in that year. How can I merge the new data to the old data, creating new columns and padding the old data with zero in those columns?

File 1: "Date","Time","Species A","Species B", "Species X"

File 2: "Date","Time", "Species A", "Species B", "Species C"

I need the end result to be one csv with this header: "Date","Time","Species A","Species B", "Species C", "Species X"

3 Answers 3

6

Someone else will probably post a solution using the csv module, so I'll give a pandas solution for comparison purposes:

import pandas as pd

df1 = pd.read_csv("fish1.csv")
df2 = pd.read_csv("fish2.csv")

df = pd.concat([df1, df2]).fillna(0)
df = df[["Date", "Time"] + list(df.columns[1:-1])]
df.to_csv("merged_fish.csv", index=False)

Explanation:

First, we read in the two files:

>>> df1 = pd.read_csv("fish1.csv")
>>> df2 = pd.read_csv("fish2.csv")
>>> df1
   Date  Time  Species A  Species B  Species X
0     1     2          3          4          5
1     6     7          8          9         10
2    11    12         13         14         15
>>> df2
   Date  Time  Species A  Species B  Species C
0    16    17         18         19         20
1    21    22         23         24         25
2    26    27         28         29         30

Then we simply concatenate them, which automatically fills the missing data with NaN:

>>> df = pd.concat([df1, df2])
>>> df
   Date  Species A  Species B  Species C  Species X  Time
0     1          3          4        NaN          5     2
1     6          8          9        NaN         10     7
2    11         13         14        NaN         15    12
0    16         18         19         20        NaN    17
1    21         23         24         25        NaN    22
2    26         28         29         30        NaN    27

You want them filled with 0 instead, so:

>>> df = pd.concat([df1, df2]).fillna(0)
>>> df
   Date  Species A  Species B  Species C  Species X  Time
0     1          3          4          0          5     2
1     6          8          9          0         10     7
2    11         13         14          0         15    12
0    16         18         19         20          0    17
1    21         23         24         25          0    22
2    26         28         29         30          0    27

This order isn't quite the one you asked for, though, you wanted Time and Date first, so:

>>> df = df[["Date", "Time"] + list(df.columns[1:-1])]
>>> df
   Date  Time  Species A  Species B  Species C  Species X
0     1     2          3          4          0          5
1     6     7          8          9          0         10
2    11    12         13         14          0         15
0    16    17         18         19         20          0
1    21    22         23         24         25          0
2    26    27         28         29         30          0

And then we save it as a CSV file:

>>> df.to_csv("merged_fish.csv", index=False)

producing

Date,Time,Species A,Species B,Species C,Species X
1,2,3,4,0.0,5.0
6,7,8,9,0.0,10.0
11,12,13,14,0.0,15.0
16,17,18,19,20.0,0.0
21,22,23,24,25.0,0.0
26,27,28,29,30.0,0.0
Sign up to request clarification or add additional context in comments.

1 Comment

This worked brilliantly, thank you! Pandas seems like it'll be very useful for other things I need as well.
1

Here's a csv module solution in Python 3:

import csv

# Generate some data...

csv1 = '''\
Date,Time,Species A,Species B,Species C
04/01/2012,13:00,1,2,3
04/02/2012,13:00,1,2,3
04/03/2012,13:00,1,2,3
04/04/2012,13:00,1,2,3
'''

csv2 = '''\
Date,Time,Species A,Species B,Species X
04/01/2013,13:00,1,2,3
04/02/2013,13:00,1,2,3
04/03/2013,13:00,1,2,3
04/04/2013,13:00,1,2,3
'''

with open('2012.csv','w') as f:
    f.write(csv1)
with open('2013.csv','w') as f:
    f.write(csv2)

# The actual program

years = ['2012.csv','2013.csv']

lines = []
headers = set()
for year in years:
    with open(year,'r',newline='') as f:
        r = csv.DictReader(f)
        lines.extend(list(r))                 # Merge lines from all files.
        headers = headers.union(r.fieldnames) # Collect unique column names.

# Sort the unique headers keeping Date,Time columns first.
new_headers = ['Date','Time'] + sorted(headers - set(['Date','Time']))

with open('result.csv','w',newline='') as f:
    # The 3rd parameter is the default if the key isn't present.
    w = csv.DictWriter(f,new_headers,0)
    w.writeheader()
    w.writerows(lines)

# View the result

with open('result.csv') as f:
    print(f.read())

Output:

Date,Time,Species A,Species B,Species C,Species X
04/01/2012,13:00,1,2,3,0
04/02/2012,13:00,1,2,3,0
04/03/2012,13:00,1,2,3,0
04/04/2012,13:00,1,2,3,0
04/01/2013,13:00,1,2,0,3
04/02/2013,13:00,1,2,0,3
04/03/2013,13:00,1,2,0,3
04/04/2013,13:00,1,2,0,3

Comments

0

According to the docs, it looks like you should be able to read out both files, merge the keys from the 2 extracted dictionaries, then use the fieldnames and restval params on the writer to achieve your 0 defaults.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.