strange behavior while parsing tab separated file in python

Question

I am parsing a tab separated file where the first element is a twitter hashtag and the second element is the tweet contents.

My input file looks like:

#trumpisanabuser    of young black men . calling for the execution of the innocent !url "
#centralparkfiv of young black men . calling for the execution of the innocent !url "
#trumppence16   "
#trumppence16   "
#america2that   @user "

and My code does is to filter out the duplicate contents such as retweets by checking if the second tab-sepearted element is a duplicate.

import sys
import csv

tweetfile = sys.argv[1]
tweetset = set()
with open(tweetfile, "rt") as f:
    reader = csv.reader(f, delimiter = '\t')
    for row in reader:
       print("hashtag: " + str(row[0]) + "\t" + "tweet: " + str(row[1]))
       row[1] = row[1].replace("\\ n", "").rstrip()
       if row[1] in tweetset: 
          continue  
       temp = row[1].replace("!url","")
       temp = temp.replace("@user","")
       temp = "".join([c if c.isalnum() else "" for c in temp])
       if temp: 
           taglines.append(row[0] + "\t" + row[1])
       tweetset.add(row[1])

However, the parsing is done weird. When I print each parsed item, the output is as the following. Can anyone explain why the parsing breaks and caused this line to be printed (hashtag: #trumppence16 tweet:, newline, then #trumppence16)?

hashtag: #centralparkfive   tweet: of young black men . calling for the execution of the innocent !url "
hashtag: #trumppence16  tweet: 
#trumppence16   
hashtag: #america2that  tweet: @user "

you have unterminated quotes in the file

e4c5
– e4c5

2017-01-03 07:53:56 +00:00
Commented Jan 3, 2017 at 7:53 — e4c5
– e4c5, Commented Jan 3, 2017 at 7:53

Martijn Pieters · Accepted Answer · 2017-01-03 07:59:31Z

1

You have lines with " for the tweet. CSV can quote columns by quoting them with " around the value, including newlines. Everything from the opening " to the next closing " is a single column value.

You can disable quote handling by setting the quoting option to csv.QUOTE_NONE:

reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)

answered Jan 3, 2017 at 7:59

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

strange behavior while parsing tab separated file in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related