4

I have a very large dictionary which stores large numbers of English sentences and their Spanish translations. When given a random English sentence, I intend to use Python's fuzzywuzzy library to find its closest match in the dictionary. My code:

from fuzzywuzzy import process
sentencePairs = {'How are you?':'¿Cómo estás?', 'Good morning!':'¡Buenos días!'}
query= 'How old are you?'
match = process.extractOne(query, sentencePairs.keys())[0]
print(match, sentencePairs[match], sep='\n')

In real life scenario, the sentencePairs dictionary would be very large, with at least one million items stored. So it will take a long time to get the result with fuzzywuzzy, even if python-Levenshtein is installed to provide speedup. So is there a better way to achieve better performance? My goal is to get the result in less than a few seconds, or even in real time.

1
  • 1
    Note, you really aren't using a dict here as it is intended, you might as well use a list of tuples, or two separate lists Commented Sep 14, 2020 at 15:36

2 Answers 2

6

Ways to improve the performance

Fuzzy Matching using the Levenshtein Distance will never be super fast, but there are a couple of things in your code you can optimise:

  1. When passing a string and a list to process.extractOne it will preprocess these strings by lowercasing them, removing non alphanumeric characters and trimming whitespaces. Since your reusing the same English:Spanish mapping each time you should do this preprocessing once ahead of time.

  2. Even when using python-Levenshtein FuzzyWuzzy is not really optimised in a lot of places. You should replace it with RapidFuzz which implements the same algorithms with a similar interface, but is mostly implemented in C++ and comes with some additional algorithmic improvements making it a lot faster.

  3. internally process.extractOne is using fuzz.WRatio to compare the strings by default. This is a combination of multiple string matching algorithms. So selecting a faster algorithm by passing e.g. scorer=fuzz.ratio to process.extractOne improves the performance. However keep in mind that this changes the way your strings are compared, so depending on your data you might not want to do this.

Implementation making use of 1 and 2

from rapidfuzz import process, utils
# english sentences are already lower cased
# and without special characters like question marks
sentencePairs = {'how are you':'¿Cómo estás?', 'good morning':'¡Buenos días!'}
query= 'How old are you?'
match, _ = process.extractOne(
   utils.default_process(query),
   sentencePairs.keys(),
   processor=None)
print(match, sentencePairs[match], sep='\n')

Implementation making use of 1, 2 and 3

from rapidfuzz import process, utils, fuzz
# english sentences are already lower cased
# and without special characters like question marks
sentencePairs = {'how are you':'¿Cómo estás?', 'good morning':'¡Buenos días!'}
query= 'How old are you?'
match, _ = process.extractOne(
   utils.default_process(query),
   sentencePairs.keys(),
   processor=None,
   scorer=fuzz.ratio)
print(match, sentencePairs[match], sep='\n')

Benchmarks

To provide some time comparisions I generated a million sentences:

import string
import random
random.seed(18)
sentencePairs = {
    ''.join(random.choice(string.ascii_lowercase + string.digits)
       for _ in range(15)
    ): "spanish text"
    for s in range(1000000)
}
query= 'How old are you?'

The following table shows how long the different solutions require on my computer

| Implementation                           | Runtime        |
|------------------------------------------|----------------|
| Your current implementation              | 18.98 seconds  |
| Implementation making use of 1 and 2     | 1.4 seconds    |
| Implementation making use of 1, 2 and 3  | 0.4 seconds    |
Sign up to request clarification or add additional context in comments.

4 Comments

I don't find fuzz.ratio or fuzz.WRatio in your code. Do you mean using them to loop over all sentenPairs keys instead of process.extractOne() and it takes 0.4 seconds when using fuzz.ratio?
prcess.extractOne supports a scorer parameter to change the string matching algorithm it is using. I updated the answer to make this more clear.
Regarding the utils.default_process and processor, can I just use match, _ = process.extractOne(query, sentencePairs.keys(), scorer=fuzz.ratio) instead of match, _ = process.extractOne(utils.default_process(query), sentencePairs.keys(), processor=None, scorer=fuzz.ratio)? There seems to be no difference between these two lines of code, and the execution times are basically the same based on my testing. I also checked github.com/maxbachmann/rapidfuzz/blob/master/docs/usage/…, still can't find any difference.
utils.default_process is preprocessing your string (lower cases it, removes non alphanumeric characters like e.g. question marks and trims whitespaces). So when you do not need/want this you can just leave it out (e.g. the String " How old are you?" is transformed to "how old are you")
2

There could be a better solution but top of my mind I could think of partition.

You can create 26 different dictionaries each representing an English alphabet. And then you can load all these dictionaries with all those keys which starts with corresponding alphabet. E.g. adict, bdict... zdict etc. So. hdict would contain Key value for Key starting with h. Like key= "how are you?"

In this way, you need to query only that dictionary which matches with starting alphabet.

1 Comment

With this solution, you lose the "fuzzy" search on the first character. It's still a valid option, though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.