Group strings with values in Python

Question

I'm working on twitter hashtags and I've already counted the number of times they appear in my csv file. My csv file look like:

GilletsJaunes, 100
Macron, 50
gilletsjaune, 20
tax, 10

Now, I would like to group together 2 terms that are close, such as "GilletsJaunes" and "gilletsjaune" using the fuzzywuzzy library. If the proximity between the 2 terms is greater than 80, then their value is added in only one of the 2 terms and the other is deleted. This would give:

GilletsJaunes, 120
Macron, 50
tax, 10

For use "fuzzywuzzy":

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzz.ratio("GiletsJaunes", "giletsjaune")
82 #output

What have you tried so far? Please show your attempt so that we can help you correct it. — Soviut
– Soviut, Commented Mar 7, 2019 at 22:28

Wok · Accepted Answer · 2019-04-27 10:58:08Z

First, copy these two functions to be able to compute the argmax:

# given an iterable of pairs return the key corresponding to the greatest value
def argmax(pairs):
    return max(pairs, key=lambda x: x[1])[0]


# given an iterable of values return the index of the greatest value
def argmax_index(values):
    return argmax(enumerate(values))

Second, load the content of your CSV into a Python dictionary and proceed as follows:

from fuzzywuzzy import fuzz

input = {
    'GilletsJaunes': 100,
    'Macron': 50,
    'gilletsjaune': 20,
    'tax': 10,
}

threshold = 50

output = dict()
for query in input:
    references = list(output.keys()) # important: this is output.keys(), not input.keys()!
    scores = [fuzz.ratio(query, ref) for ref in references]
    if any(s > threshold for s in scores):
        best_reference = references[argmax_index(scores)]
        output[best_reference] += input[query]
    else:
        output[query] = input[query]

print(output)

{'GilletsJaunes': 120, 'Macron': 50, 'tax': 10}

ap288 · Accepted Answer · 2019-03-07 22:51:20Z

0

This solves your problem. You can reduce your input sample by first converting your tags to lowercase. I'm not sure how fuzzywuzzy works, but I would suspect that "HeLlO" and "hello" and "HELLO" are always going to be greater than an 80, and they represent the same word.

import csv
from fuzzywuzzy import fuzz

data = dict()
output = dict()
tags = list()

with open('file.csv') as csvDataFile:
    csvReader = csv.reader(csvDataFile)
    for row in csvReader:
        data[row[0]] = row[1]
        tags.append(row[0])

for tag in tags:
    output[tag] = 0
    for key in data.keys():
        if fuzz.ratio(tag, key) > 80:
            output[tag] = output[tag] + data[key]

answered Mar 7, 2019 at 22:51

ap288

415 bronze badges

Collectives™ on Stack Overflow

Group strings with values in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related