I am parsing a tab separated file where the first element is a twitter hashtag and the second element is the tweet contents.
My input file looks like:
#trumpisanabuser of young black men . calling for the execution of the innocent !url "
#centralparkfiv of young black men . calling for the execution of the innocent !url "
#trumppence16 "
#trumppence16 "
#america2that @user "
and My code does is to filter out the duplicate contents such as retweets by checking if the second tab-sepearted element is a duplicate.
import sys
import csv
tweetfile = sys.argv[1]
tweetset = set()
with open(tweetfile, "rt") as f:
reader = csv.reader(f, delimiter = '\t')
for row in reader:
print("hashtag: " + str(row[0]) + "\t" + "tweet: " + str(row[1]))
row[1] = row[1].replace("\\ n", "").rstrip()
if row[1] in tweetset:
continue
temp = row[1].replace("!url","")
temp = temp.replace("@user","")
temp = "".join([c if c.isalnum() else "" for c in temp])
if temp:
taglines.append(row[0] + "\t" + row[1])
tweetset.add(row[1])
However, the parsing is done weird. When I print each parsed item, the output is as the following. Can anyone explain why the parsing breaks and caused this line to be printed (hashtag: #trumppence16 tweet:, newline, then #trumppence16)?
hashtag: #centralparkfive tweet: of young black men . calling for the execution of the innocent !url "
hashtag: #trumppence16 tweet:
#trumppence16
hashtag: #america2that tweet: @user "