Merge one column from variable number of csv files into one csv file

Question

Novice Python programmer here. I know there are a lot of SO posts relating to this, but none of the solutions I've reviewed seem to fit my problem.

I have a variable number of csv files, all with the same number of columns. The header for the fourth column will change with each csv file (it's a Julian date). Incidentally, this fourth column stores surface temperatures from a satellite sensor. As an example:

UID,Latitude,Longitude,001
1,-151.01,45.20,13121
2,-151.13,45.16,15009
3,-151.02,45.09,10067
4,-151.33,45.03,14010

I would like to keep the first four columns (preferably from the first csv file in my list of files), and then join/merge the fourth column from all the remaining csv files to this first table. The final table will look something like this:

UID,Latitude,Longitude,001,007,015,023,...
1,-151.01,45.20,13121,13129,13340,12995
2,-151.13,45.16,15009,15001,14997,15103
3,-151.02,45.09,10067,11036,10074,10921
4,-151.33,45.03,14010,14005,14102,14339

I know the Pandas package would probably be an easier way to do this, but I'd rather not require third party packages (requiring the user to use easy_install, PIP, etc.) in this tool. I also realize I this would be much simpler in an RDBMS, but again, I don't want that to be a requirement. So I'm only using the csv module.

I think I understand how to do this, and I'm assuming I should write the merged rows to a new csv file. I've gotten as far as pulling out the headers from the first csv file, then looping through each of the subsequent csv files to add the new column name to the header row. Where I'm coming up short is how to write values from the fourth column only in addition to the rows from the first csv file. All csv files have UID columns, which should match.

def build_table(acq_date_list, mosaic_io_array, input_dir, dir_list):
    acq_year = mosaic_io_array[0][0]
    out_dir = '%s\\%s\\' % (input_dir, dir_list[1])
    out_file = '%s%s_%s.%s' % (out_dir, 'LST_final', acq_year, 'csv')
    # get first csv file in the list of files
    first_file = acq_date_list[0][1]
    # open and read the first csv file
    with open(first_file, 'rb') as first_csv:
        r1 = csv.reader(first_csv, delimeter = ',')
        header1 = next(r1)
        allrows1 = []
        row1 = next(r1)
        allrows1.append(row1)
    # open and write to the new csv
    with open(out_file, 'wb') as out_csv:
        w = csv.writer(out_csv, delimeter = ',')
            # loop through the list of remaining csv files
            for acq_date in acq_date_list[1:]: # skip the first csv file
                # open and read other csv files
                with open(acq_date[1], 'rb') as other_csv:
                    rX = csv.reader(other_csv, delimeter = ',')
                    headerX = next(rX)
                    header_row = '%s,%s' % (header1, headerX)

                    # write header and subsequent merged rows to new csv file?

Maybe after:

headerX = next(rX)

I can split the header row into a list, and pull out the fourth item? Would this also work for the remaining rows in the "other" csv files. Or is this just generally the wrong approach?

UPDATE 2/26/2016 I actually only got the solution by Gijs to partially work. The header columns are iteratively added, but not the rest of the values from the row. I'm still unsure how to fill in the empty cells with values from the remaining csv files.

Latitude,001,UID,Longitude,009,017,025,033,041
795670.198,13506,0,-1717516.429,,,,,
795670.198,13173,1,-1716125.286,,,,,
795670.198,13502,2,-1714734.143,,,,,

How big is each file? Can you fit them all in memory? (thinking that loading them into a dict in memory might help somehow) — Will
– Will, Commented Feb 24, 2016 at 20:14

Gijs · Accepted Answer · 2016-02-24 20:20:58Z

1

Loop through the files, keep track of which keys exist and write all records with csv.DictWriter and csv.DictReader.

import csv

records = list()
all_keys = set()
for fn in ["table_1.csv", "table_2.csv"]:
    with open(fn) as f:
        reader = csv.DictReader(f)
        all_keys.update(set(reader.fieldnames))
        for r in reader:
            records.append(r)

with open("table_merged.csv", "wb") as f:
    writer = csv.DictWriter(f, fieldnames = all_keys)
    writer.writeheader()
    for r in records:
        writer.writerow(r)

This will write an empty 'cell' for records that didn't have the column.

With your file as both the first and the second .csv, with in the second case the last column renamed to 002 instead of 001, you would get this:

UID,Longitude,002,001,Latitude
1,45.20,,13121,-151.01
2,45.16,,15009,-151.13
3,45.09,,10067,-151.02
4,45.03,,14010,-151.33
1,45.20,13121,,-151.01
2,45.16,15009,,-151.13
3,45.09,10067,,-151.02
4,45.03,14010,,-151.33

If you want to keep the columns in a specific order, you will have to make all_keys a list, and then add only the columns in the new file that are not in all_keys.

all_keys = list()

... 
         all_keys += list(set(reader.fieldnames).difference(set(all_keys)))

edited Feb 24, 2016 at 20:20

answered Feb 24, 2016 at 20:14

Gijs

11k5 gold badges29 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jesselangdon Over a year ago

Great solution! Thanks for your help!

MaxU - stand with Ukraine · Accepted Answer · 2016-02-24 20:25:29Z

0

try pandas approach:

import pandas as pd

file_list = ['1.csv','2.csv','3.csv']

df = pd.read_csv(file_list[0])

for f in file_list[1:]:
    # use only 1-st and 4-th columns ...
    tmp = pd.read_csv(f, usecols=[0, 3])
    df = pd.merge(df, tmp, on='UID')

df.to_csv('output.csv', index=False)

print(df)

Output:

   UID  Latitude  Longitude    001    007  015
0    1   -151.01      45.20  13121  11111   11
1    2   -151.13      45.16  15009  22222   12
2    3   -151.02      45.09  10067  33333   13
3    4   -151.33      45.03  14010  44444   14

output.csv

UID,Latitude,Longitude,001,007,015
1,-151.01,45.2,13121,11111,11
2,-151.13,45.16,15009,22222,12
3,-151.02,45.09,10067,33333,13
4,-151.33,45.03,14010,44444,14

edited Feb 24, 2016 at 20:25

answered Feb 24, 2016 at 20:13

MaxU - stand with Ukraine

212k37 gold badges402 silver badges437 bronze badges

1 Comment

jesselangdon Over a year ago

This is definitely a nice approach, but I'm specifically trying to stay away from third party packages for this project. Thank you!

Collectives™ on Stack Overflow

Merge one column from variable number of csv files into one csv file

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related