3

I'm working on twitter hashtags and I've already counted the number of times they appear in my csv file. My csv file look like:

GilletsJaunes, 100
Macron, 50
gilletsjaune, 20
tax, 10

Now, I would like to group together 2 terms that are close, such as "GilletsJaunes" and "gilletsjaune" using the fuzzywuzzy library. If the proximity between the 2 terms is greater than 80, then their value is added in only one of the 2 terms and the other is deleted. This would give:

GilletsJaunes, 120
Macron, 50
tax, 10

For use "fuzzywuzzy":

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzz.ratio("GiletsJaunes", "giletsjaune")
82 #output
1
  • 5
    What have you tried so far? Please show your attempt so that we can help you correct it. Commented Mar 7, 2019 at 22:28

2 Answers 2

2

First, copy these two functions to be able to compute the argmax:

# given an iterable of pairs return the key corresponding to the greatest value
def argmax(pairs):
    return max(pairs, key=lambda x: x[1])[0]


# given an iterable of values return the index of the greatest value
def argmax_index(values):
    return argmax(enumerate(values))

Second, load the content of your CSV into a Python dictionary and proceed as follows:

from fuzzywuzzy import fuzz

input = {
    'GilletsJaunes': 100,
    'Macron': 50,
    'gilletsjaune': 20,
    'tax': 10,
}

threshold = 50

output = dict()
for query in input:
    references = list(output.keys()) # important: this is output.keys(), not input.keys()!
    scores = [fuzz.ratio(query, ref) for ref in references]
    if any(s > threshold for s in scores):
        best_reference = references[argmax_index(scores)]
        output[best_reference] += input[query]
    else:
        output[query] = input[query]

print(output)

{'GilletsJaunes': 120, 'Macron': 50, 'tax': 10}

Sign up to request clarification or add additional context in comments.

Comments

0

This solves your problem. You can reduce your input sample by first converting your tags to lowercase. I'm not sure how fuzzywuzzy works, but I would suspect that "HeLlO" and "hello" and "HELLO" are always going to be greater than an 80, and they represent the same word.

import csv
from fuzzywuzzy import fuzz

data = dict()
output = dict()
tags = list()

with open('file.csv') as csvDataFile:
    csvReader = csv.reader(csvDataFile)
    for row in csvReader:
        data[row[0]] = row[1]
        tags.append(row[0])

for tag in tags:
    output[tag] = 0
    for key in data.keys():
        if fuzz.ratio(tag, key) > 80:
            output[tag] = output[tag] + data[key]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.