Python: How to compare two csv files and add classifier in first file if value exists in second file

Question

I try to compare two csv files. First file (movements.csv) has 14 columns, second csv (LCC.csv) one single column. I want to check whether the entries (strings) of column 8 in movements.csv appear somewhere in column 1 of LCC.csv. If so, in column 14 a 'Yes' should be written, if not a 'No'. The code I tried so far is and the error message I receive:

import csv

f1 = file('LCC.csv', 'rb') 
f2 = file('movements.csv', 'rb')
f3 = ('output.csv', 'wb') 

c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)

movements = list(c2)

for LCC_row in c1:
    row = 0
    found = False
    for movements_row in movements:
        output_row = movements_row
        if movements_row[7] == LCC_row[0]
            output_row.append('Yes')
            found = True
            break
        row += 1
    if not found:
        output_row.append('No')
    c3.writerow(output_row)

f1.close()
f2.close()
f3.close()

enter image description here

I'm a complete beginner with python, so any advice is appreciated! Optimally the check between the two columns would also disregard whether the strings are written in capital letters or not.

The error message comes after

c3.writerow(output_row)

as

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
>>>

LCC.csv (no header):

Air Ab  
Jamb  
Sw  
AIRF  
EURO

movements.csv (has a header):

ap,ic,year,y_m,pas,da,ty,airl,ic_a,dest_orig,ic_d,coun,cont,LCC  
Zue,LSZH,2005,200501,25,1/1/2005,Dep,"EURO",EUJ,"Mans C",EG,Gb,Eu,   
Zue,LSZH,2005,200501,204,1/1/2005,Arr,"Sw",SWR,"Dar",HA,Tans,A,   
Ba,LSZM,2005,200501,191,1/1/2005,Arr,"AIRF",AFR,"PG",LG,Fr,Eu,   
Zue,LSZH,2005,200501,228,1/1/2005,Dep,"THA",THA,Bang,VD,Th,As,

as already said, the last column (LCC) is completely empty at the moment

I receive an error message after if movements_row[7] == LCC_row[0], namely: File "<stdin>", line 6 if movements_row[7] == LCC_row[0] ^ SyntaxError: invalid syntax — Anna Stünzi
– Anna Stünzi, Commented Dec 19, 2016 at 13:54
Please edit your question with the error message. And clearly mark which line causes it. — Code-Apprentice
– Code-Apprentice, Commented Dec 19, 2016 at 13:55
@AnnaStünzi: would it be ok to use pandas to solve this problem ?? — Vikash Singh
– Vikash Singh, Commented Dec 19, 2016 at 14:03

Moinuddin Quadri · Accepted Answer · 2016-12-19 14:01:44Z

1

It has many issues. Few which I found after glancing at the code are:

You having invalid quote ' in your line:

f2 = file('movements.csv', ,rb')
#                          ^

It should be:

f2 = file('movements.csv', 'rb')

In the code you shared you are having ` back quote at various places instead of single quote '. For example, your lines should be:
```
f1 = file('LCC.csv', 'rb') 
f3 = file('output.csv', 'wb')    
#     ^ also missing file here
```

Missing colon : after if. It should be:

if movements_row[7] == LCC_row[0]:
#                           Here ^

Also, for initializing the string, you do not need parenthesis. Just assign it like:

output_row[13] = 'Yes'
#                ^ As simple string

edited Dec 19, 2016 at 14:01

answered Dec 19, 2016 at 13:54

Moinuddin Quadri

48.4k13 gold badges101 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Code-Apprentice Over a year ago

` is called a back quote or back tick.

Anna Stünzi Over a year ago

Yes sorry, this is a copy paste error here in the forum, in the code I use I have the ' everywhere, I just checked again

iFlo Over a year ago

There is still some mistakes

Moinuddin Quadri Over a year ago

@iFlo: Yes, thats what even I find out. Every time I see code, I find few. And I am not checking for logical errors

Code-Apprentice Over a year ago

@MoinuddinQuadri SyntaxError almost always means missing or incorrect punctuation. I commonly get this when I forget colons and closing parentheses. To track down the problem, start at the line indicated in the error message and work backwards.

|

Community · Accepted Answer · 2017-05-23 11:45:43Z

0

There are quite a few bugs in your code. They have been pointed out here: https://stackoverflow.com/a/41224147/3027854

One problem with moments.csv

ap,ic,year,y_m,pas,da,ty,airl,ic_a,dest_orig,ic_d,coun,cont,LCC 
Zue,LSZH,2005,200501,25,1/1/2005,Dep,"EURO",EUJ,"Mans C",EG,Gb,Eu, 
Zue,LSZH,2005,200501,204,1/1/2005,Arr,"Sw",SWR,"Dar",HA,Tans,A, 
Ba,LSZM,2005,200501,191,1/1/2005,Arr,"AIRF",AFR,"PG",LG,Fr,Eu, 
Zue,LSZH,2005,200501,228,1/1/2005,Dep,"THA",THA,Bang,VD,Th,As,

except the header line you have one extra column in each line. As they end with ", ". I have added handling for that in my code

import csv

f1 = open('LCC.csv', 'rU') 
f2 = open('movements.csv', 'rU')
f3 = open('output.csv', 'w') 

c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)

# first we will read all LCC values into a set.
LCC_row_values = set()
for LCC_row in c1:
    LCC_row_values.add(LCC_row[0].strip())

row = 0
for movements_row in c2:
    row += 1
    if row == 1:
        # movements_row.append('is_present')
        # c3.writerow(movements_row)
        # skip header of moments.csv file
        continue
    # Remove last extra column from output row
    output_row = movements_row[:-1]
    if movements_row[7] in LCC_row_values:
        output_row.append('Yes')
    else:
        output_row.append('No')
    c3.writerow(output_row)

f1.close()
f2.close()
f3.close()

Here example files are

LCC.csv

Air Ab 
Jamb 
Sw 
AIRF 
EURO

movements.csv

ap,ic,year,y_m,pas,da,ty,airl,ic_a,dest_orig,ic_d,coun,cont,LCC 
Zue,LSZH,2005,200501,25,1/1/2005,Dep,"EURO",EUJ,"Mans C",EG,Gb,Eu, 
Zue,LSZH,2005,200501,204,1/1/2005,Arr,"Sw",SWR,"Dar",HA,Tans,A, 
Ba,LSZM,2005,200501,191,1/1/2005,Arr,"AIRF",AFR,"PG",LG,Fr,Eu, 
Zue,LSZH,2005,200501,228,1/1/2005,Dep,"THA",THA,Bang,VD,Th,As,

output.csv

Zue,LSZH,2005,200501,25,1/1/2005,Dep,EURO,EUJ,Mans C,EG,Gb,Eu,Yes
Zue,LSZH,2005,200501,204,1/1/2005,Arr,Sw,SWR,Dar,HA,Tans,A,Yes
Ba,LSZM,2005,200501,191,1/1/2005,Arr,AIRF,AFR,PG,LG,Fr,Eu,Yes
Zue,LSZH,2005,200501,228,1/1/2005,Dep,THA,THA,Bang,VD,Th,As,No

edited May 23, 2017 at 11:45

CommunityBot

11 silver badge

answered Dec 19, 2016 at 14:07

Vikash Singh

14.1k9 gold badges45 silver badges73 bronze badges

11 Comments

Anna Stünzi Over a year ago

Hi, thanks a lot! I adapted your code so that column 8 is compared to column 1. However, I still receive an error message: Traceback (most recent call last): File "<stdin>", line 1, in <module> _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode? >>> What should I do? Also, I do not want to add a new column but fill the last one (column 14) which is currently empty. Thanks!

Vikash Singh Over a year ago

Please share top 5 lines of both CSV files. Put it in the original question. It will help me solve your problem.

Vikash Singh Over a year ago

please check now. Do you need header in output file?

Anna Stünzi Over a year ago

I still receive quite the same error message: Traceback (most recent call last): File "<stdin>", line 7, in <module> _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode? I don't need a header in the output file.

Vikash Singh Over a year ago

@AnnaStünzi line 7 has some issue in the csv file. Can you add that line in the sample you have shared ..

|

Patrick Haugh · Accepted Answer · 2016-12-19 14:05:23Z

0

You're trying to do too much at the same time. Split this into different tasks. First we'll read the contents of LCC.csv into a set (we could use a list, but sets are better for determining membership). Then we will go through movements.csv to rewrite it.

import csv

with open('LCC.csv', 'rb') as lcc:
    lcc_set = set()
    lcc_r = csv.reader(lcc)
    for l in lcc_r:
        for i in l:
            lcc_set.add(i)

with open('movements.csv', 'rb') as movements:
    mov_r = csv.reader(movements)
    with open('output.csv', 'wb') as output:
        out_w = csv.writer(output)
        for l in mov_r:
            #l.pop()
            if l[7] in lcc_set:
                l.append('Yes')
            else:
                l.append('No')
            out_w.writerow(l)

I'm not clear if you wanted to add a column or replace the last one. I've commented out the line that will cause the last column to be replaced by Yes or No

answered Dec 19, 2016 at 14:05

Patrick Haugh

61.4k13 gold badges94 silver badges101 bronze badges

4 Comments

Anna Stünzi Over a year ago

Hi Patrick, thanks for your help too. Using your code I get following error message: Traceback (most recent call last): File "<stdin>", line 11, in <module> AttributeError: '_csv.writer' object has no attribute 'writerowl' >>> do you see a reason why? Thank you so much!

Patrick Haugh Over a year ago

@AnnaStünzi It looks like you're missing the parenthesis writerowl -> writerow(l)

Anna Stünzi Over a year ago

sorry, yes - the code is not giving any error anymore, but there is no classification in the output.csv, all rows in column 14 have now entry 'No'..

Patrick Haugh Over a year ago

It looks like that's because it's Sw in one file and "Sw" in the other. do if l[7].strip('"') in lcc_set instead

Collectives™ on Stack Overflow

Python: How to compare two csv files and add classifier in first file if value exists in second file

3 Answers 3

8 Comments

11 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

11 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related