1

Novice Python programmer here. I know there are a lot of SO posts relating to this, but none of the solutions I've reviewed seem to fit my problem.

I have a variable number of csv files, all with the same number of columns. The header for the fourth column will change with each csv file (it's a Julian date). Incidentally, this fourth column stores surface temperatures from a satellite sensor. As an example:

UID,Latitude,Longitude,001
1,-151.01,45.20,13121
2,-151.13,45.16,15009
3,-151.02,45.09,10067
4,-151.33,45.03,14010

I would like to keep the first four columns (preferably from the first csv file in my list of files), and then join/merge the fourth column from all the remaining csv files to this first table. The final table will look something like this:

UID,Latitude,Longitude,001,007,015,023,...
1,-151.01,45.20,13121,13129,13340,12995
2,-151.13,45.16,15009,15001,14997,15103
3,-151.02,45.09,10067,11036,10074,10921
4,-151.33,45.03,14010,14005,14102,14339

I know the Pandas package would probably be an easier way to do this, but I'd rather not require third party packages (requiring the user to use easy_install, PIP, etc.) in this tool. I also realize I this would be much simpler in an RDBMS, but again, I don't want that to be a requirement. So I'm only using the csv module.

I think I understand how to do this, and I'm assuming I should write the merged rows to a new csv file. I've gotten as far as pulling out the headers from the first csv file, then looping through each of the subsequent csv files to add the new column name to the header row. Where I'm coming up short is how to write values from the fourth column only in addition to the rows from the first csv file. All csv files have UID columns, which should match.

def build_table(acq_date_list, mosaic_io_array, input_dir, dir_list):
    acq_year = mosaic_io_array[0][0]
    out_dir = '%s\\%s\\' % (input_dir, dir_list[1])
    out_file = '%s%s_%s.%s' % (out_dir, 'LST_final', acq_year, 'csv')
    # get first csv file in the list of files
    first_file = acq_date_list[0][1]
    # open and read the first csv file
    with open(first_file, 'rb') as first_csv:
        r1 = csv.reader(first_csv, delimeter = ',')
        header1 = next(r1)
        allrows1 = []
        row1 = next(r1)
        allrows1.append(row1)
    # open and write to the new csv
    with open(out_file, 'wb') as out_csv:
        w = csv.writer(out_csv, delimeter = ',')
            # loop through the list of remaining csv files
            for acq_date in acq_date_list[1:]: # skip the first csv file
                # open and read other csv files
                with open(acq_date[1], 'rb') as other_csv:
                    rX = csv.reader(other_csv, delimeter = ',')
                    headerX = next(rX)
                    header_row = '%s,%s' % (header1, headerX)

                    # write header and subsequent merged rows to new csv file?

Maybe after:

headerX = next(rX)

I can split the header row into a list, and pull out the fourth item? Would this also work for the remaining rows in the "other" csv files. Or is this just generally the wrong approach?

UPDATE 2/26/2016 I actually only got the solution by Gijs to partially work. The header columns are iteratively added, but not the rest of the values from the row. I'm still unsure how to fill in the empty cells with values from the remaining csv files.

Latitude,001,UID,Longitude,009,017,025,033,041
795670.198,13506,0,-1717516.429,,,,,
795670.198,13173,1,-1716125.286,,,,,
795670.198,13502,2,-1714734.143,,,,,
1
  • How big is each file? Can you fit them all in memory? (thinking that loading them into a dict in memory might help somehow) Commented Feb 24, 2016 at 20:14

2 Answers 2

1

Loop through the files, keep track of which keys exist and write all records with csv.DictWriter and csv.DictReader.

import csv

records = list()
all_keys = set()
for fn in ["table_1.csv", "table_2.csv"]:
    with open(fn) as f:
        reader = csv.DictReader(f)
        all_keys.update(set(reader.fieldnames))
        for r in reader:
            records.append(r)

with open("table_merged.csv", "wb") as f:
    writer = csv.DictWriter(f, fieldnames = all_keys)
    writer.writeheader()
    for r in records:
        writer.writerow(r)

This will write an empty 'cell' for records that didn't have the column.

With your file as both the first and the second .csv, with in the second case the last column renamed to 002 instead of 001, you would get this:

UID,Longitude,002,001,Latitude
1,45.20,,13121,-151.01
2,45.16,,15009,-151.13
3,45.09,,10067,-151.02
4,45.03,,14010,-151.33
1,45.20,13121,,-151.01
2,45.16,15009,,-151.13
3,45.09,10067,,-151.02
4,45.03,14010,,-151.33

If you want to keep the columns in a specific order, you will have to make all_keys a list, and then add only the columns in the new file that are not in all_keys.

all_keys = list()

... 
         all_keys += list(set(reader.fieldnames).difference(set(all_keys)))
Sign up to request clarification or add additional context in comments.

1 Comment

Great solution! Thanks for your help!
0

try pandas approach:

import pandas as pd

file_list = ['1.csv','2.csv','3.csv']

df = pd.read_csv(file_list[0])

for f in file_list[1:]:
    # use only 1-st and 4-th columns ...
    tmp = pd.read_csv(f, usecols=[0, 3])
    df = pd.merge(df, tmp, on='UID')

df.to_csv('output.csv', index=False)

print(df)

Output:

   UID  Latitude  Longitude    001    007  015
0    1   -151.01      45.20  13121  11111   11
1    2   -151.13      45.16  15009  22222   12
2    3   -151.02      45.09  10067  33333   13
3    4   -151.33      45.03  14010  44444   14

output.csv

UID,Latitude,Longitude,001,007,015
1,-151.01,45.2,13121,11111,11
2,-151.13,45.16,15009,22222,12
3,-151.02,45.09,10067,33333,13
4,-151.33,45.03,14010,44444,14

1 Comment

This is definitely a nice approach, but I'm specifically trying to stay away from third party packages for this project. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.