2

I recently extracted text data from a directory of pdf files. When reading pdfs, sometimes the text returned is a little messy.

For example, I can be looking at a string that says:

"T he administrati on is doing bad things, and not fulfilling what it prom ised"

I want the result to be:

"The administration is doing bad things, and not fulfilling what it promised"

I tested code (using pyenchant and wx) I found on stackoverflow here and it did not return what I wanted. My modifications were as follows:

a = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
chkr = enchant.checker.SpellChecker("en_US")
chkr.set_text(a)
for err in chkr:
    sug = err.suggest()[0]
    err.replace(sug)

c = chkr.get_text()#returns corrected text
print(c)

This code returns:

"T he administrate on is doing bad things, and not fulfilling what it prom side"

I'm using Python 3.5.x on a Windows 7 Enterprise, 64-bit. I would appreciate any suggestions!

1
  • What is the question? Commented Dec 9, 2017 at 16:22

2 Answers 2

2

I have taken Generic Human’s answer, slightly modified it to solve your problem.

You need to copy these 125k words, sorted by frequency into a text file, name the file words-by-frequency.txt.

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
with open("words-by-frequency.txt") as f:
    words = [line.strip() for line in f.readlines()]
    wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
    maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

Running the function with the input:

messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"

print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())


The administration is doing bad things and not fulfilling what it promised
>>> 

Edit: The code below doesn't require the text file and works just for your input i.e., "T he administrati on is doing bad things, and not fulfilling what it prom ised"

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = ["the", "administration", "is", "doing", "bad",
         "things", "and", "not", "fulfilling", "what",
         "it", "promised"]
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))


messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"

print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())

The administration is doing bad things and not fulfilling what it promised
>>> 

I have just tried the above edit at repl.it and it printed the output as shown.

Sign up to request clarification or add additional context in comments.

8 Comments

I copied and pasted your code and the output doesn't match what your output is: T h e a d m i n i s t r a t i o n i s d o i n g b a d t h i n g s , a n d n o t f u l f i l l i n g w h a t i t p r o m i s e d
Have you copied the text file?
If you have not yet, copy these 125k words, sorted by frequency into a text file and put the file in the same folder as your Python file.
You can also download the text file from My Drive. My output is still the same: The administration is doing bad things and not fulfilling what it promised
Please see the edit, which hopefully shows that the original code works for inputs similar to yours.
|
1

It looks like the enchant library you're using just isn't that good. It doesn't look for spelling mistakes across words, but instead just looks at words individually. I guess this makes sense since the function itself is called 'SpellChecker'.

The only thing I can think of is to look for better autocorrect libraries. Maybe this one might help? https://github.com/phatpiglet/autocorrect

No guarantees though.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.