"Spell check" and return the corrected term in Python

Question

I recently extracted text data from a directory of pdf files. When reading pdfs, sometimes the text returned is a little messy.

For example, I can be looking at a string that says:

"T he administrati on is doing bad things, and not fulfilling what it prom ised"

I want the result to be:

"The administration is doing bad things, and not fulfilling what it promised"

I tested code (using pyenchant and wx) I found on stackoverflow here and it did not return what I wanted. My modifications were as follows:

a = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
chkr = enchant.checker.SpellChecker("en_US")
chkr.set_text(a)
for err in chkr:
    sug = err.suggest()[0]
    err.replace(sug)

c = chkr.get_text()#returns corrected text
print(c)

This code returns:

"T he administrate on is doing bad things, and not fulfilling what it prom side"

I'm using Python 3.5.x on a Windows 7 Enterprise, 64-bit. I would appreciate any suggestions!

What is the question?

wwii
– wwii

2017-12-09 16:22:42 +00:00
Commented Dec 9, 2017 at 16:22 — wwii
– wwii, Commented Dec 9, 2017 at 16:22

SKN · Accepted Answer · 2017-12-19 03:17:13Z

2

I have taken Generic Human’s answer, slightly modified it to solve your problem.

You need to copy these 125k words, sorted by frequency into a text file, name the file words-by-frequency.txt.

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
with open("words-by-frequency.txt") as f:
    words = [line.strip() for line in f.readlines()]
    wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
    maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

Running the function with the input:

messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"

print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())


The administration is doing bad things and not fulfilling what it promised
>>>

Edit: The code below doesn't require the text file and works just for your input i.e., "T he administrati on is doing bad things, and not fulfilling what it prom ised"

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = ["the", "administration", "is", "doing", "bad",
         "things", "and", "not", "fulfilling", "what",
         "it", "promised"]
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))


messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"

print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())

The administration is doing bad things and not fulfilling what it promised
>>>

I have just tried the above edit at repl.it and it printed the output as shown.

edited Dec 19, 2017 at 3:17

answered Dec 9, 2017 at 17:50

SKN

2,6521 gold badge14 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Alison LT Over a year ago

I copied and pasted your code and the output doesn't match what your output is: T h e a d m i n i s t r a t i o n i s d o i n g b a d t h i n g s , a n d n o t f u l f i l l i n g w h a t i t p r o m i s e d

SKN Over a year ago

Have you copied the text file?

SKN Over a year ago

If you have not yet, copy these 125k words, sorted by frequency into a text file and put the file in the same folder as your Python file.

SKN Over a year ago

You can also download the text file from My Drive. My output is still the same: The administration is doing bad things and not fulfilling what it promised

SKN Over a year ago

Please see the edit, which hopefully shows that the original code works for inputs similar to yours.

|

Phil · Accepted Answer · 2017-12-09 16:21:28Z

1

It looks like the enchant library you're using just isn't that good. It doesn't look for spelling mistakes across words, but instead just looks at words individually. I guess this makes sense since the function itself is called 'SpellChecker'.

The only thing I can think of is to look for better autocorrect libraries. Maybe this one might help? https://github.com/phatpiglet/autocorrect

No guarantees though.

answered Dec 9, 2017 at 16:21

Phil

1,6844 gold badges16 silver badges19 bronze badges

Collectives™ on Stack Overflow

"Spell check" and return the corrected term in Python

2 Answers 2

8 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related