Revisions to Python script to extract text from PDF with images

deleted 2 characters in body

Source Link

edited May 20, 2019 at 14:24

41.7k
7
70
134

iI don't see a good reason why this should be a class. You only have two things in your state, self.text, which you could pass as an argument, and self.path, self.output_path, which I would also pass as an argumentarguments, maybe with a default value.

i don't see a good reason why this should be a class. You only have two things in your state, self.text, which you could pass as an argument, and self.path, self.output_path, which I would also pass as an argument, maybe with a default value.

I don't see a good reason why this should be a class. You only have two things in your state, self.text, which you could pass as an argument, and self.path, self.output_path, which I would also pass as arguments, maybe with a default value.

added 58 characters in body

Source Link

edited May 20, 2019 at 10:43

Graipher

41.7k
7
70
134

However, what would be a better algorithm is to first extract all the words (for example using a regex filtering only letters) and then count the number of times each word occurs using a collections.Counter, optionally filtering it down to only those words which are keywords. It even has a most_common method, so your file will be ordered by number of occurrences, descending.

At the same time, doing self.filename.strip('.pdf') is a bit dangerous. It removes all characters given, until none of the characters is found anymore. For example, "some_file_name_fdp.pdf" will be reduced to "some_file_name_".

The csv.writer has a writerows method that takes an iterable of rows. This way you can avoid a for loop.

from collections import Counter
import csv
from pathlib import Path
import re
import textract

def extract_text(file_name):
    return textract.process(file_name, method='tesseract', language='eng',
                            encoding='utf-8').decode('utf-8')

def extract_words(text):
    return re.findall(r'([a-zA-Z]+)', text)

def count_keywords(words, keywords):
    return Counter(word for word in words if word in keywords)

def read_keywords(file_name):
    with open(file_name) as f:
        return {line.strip() for line in f}

def save_keywords(file_name, keywords):
    with open(file_name, "w", newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        writer.writerow(['keyword', 'keyword_count'])
        writer.writerows(keywords.itemsmost_common())

def main():
    output_folder = Path("output_results")
    keywords = read_keywords('keywords.txt')

    for f in Path("folderForPdf").glob("*.pdf"):
        words = extract_words(extract_text(f))
        keyword_counts = count_keywords(words, keywords)
        save_keywords(output_folder / f.with_suffix(".csv"), keyword_counts)

if __name__ == "__main__":
    main()

However, what would be a better algorithm is to first extract all the words (for example using a regex filtering only letters) and then count the number of times each word occurs using a collections.Counter, optionally filtering it down to only those words which are keywords.

The csv.writer has a writerows method that takes an iterable of rows. This way you can avoid a for loop.

from collections import Counter
import csv
from pathlib import Path
import re
import textract

def extract_text(file_name):
    return textract.process(file_name, method='tesseract', language='eng',
                            encoding='utf-8').decode('utf-8')

def extract_words(text):
    return re.findall(r'([a-zA-Z]+)', text)

def count_keywords(words, keywords):
    return Counter(word for word in words if word in keywords)

def read_keywords(file_name):
    with open(file_name) as f:
        return {line.strip() for line in f}

def save_keywords(file_name, keywords):
    with open(file_name, "w", newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        writer.writerow(['keyword', 'keyword_count'])
        writer.writerows(keywords.items())

def main():
    output_folder = Path("output_results")
    keywords = read_keywords('keywords.txt')

    for f in Path("folderForPdf").glob("*.pdf"):
        words = extract_words(extract_text(f))
        keyword_counts = count_keywords(words, keywords)
        save_keywords(output_folder / f.with_suffix(".csv"), keyword_counts)

if __name__ == "__main__":
    main()

However, what would be a better algorithm is to first extract all the words (for example using a regex filtering only letters) and then count the number of times each word occurs using a collections.Counter, optionally filtering it down to only those words which are keywords. It even has a most_common method, so your file will be ordered by number of occurrences, descending.

At the same time, doing self.filename.strip('.pdf') is a bit dangerous. It removes all characters given, until none of the characters is found anymore. For example, "some_file_name_fdp.pdf" will be reduced to "some_file_name_".

The csv.writer has a writerows method that takes an iterable of rows. This way you can avoid a for loop.

from collections import Counter
import csv
from pathlib import Path
import re
import textract

def extract_text(file_name):
    return textract.process(file_name, method='tesseract', language='eng',
                            encoding='utf-8').decode('utf-8')

def extract_words(text):
    return re.findall(r'([a-zA-Z]+)', text)

def count_keywords(words, keywords):
    return Counter(word for word in words if word in keywords)

def read_keywords(file_name):
    with open(file_name) as f:
        return {line.strip() for line in f}

def save_keywords(file_name, keywords):
    with open(file_name, "w", newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        writer.writerow(['keyword', 'keyword_count'])
        writer.writerows(keywords.most_common())

def main():
    output_folder = Path("output_results")
    keywords = read_keywords('keywords.txt')

    for f in Path("folderForPdf").glob("*.pdf"):
        words = extract_words(extract_text(f))
        keyword_counts = count_keywords(words, keywords)
        save_keywords(output_folder / f.with_suffix(".csv"), keyword_counts)

if __name__ == "__main__":
    main()

added 58 characters in body

Source Link

edited May 20, 2019 at 10:37

Graipher

41.7k
7
70
134

Instead of mucking around with os.getcwd() and os.listdir, I would recommend to use the (Python 3) pathlib.Pathpathlib.Path object. It supports globbing (to get all files matching a pattern), chaining them to get a new path and even replacing the extension with a different one.

When reading the keywords, you can use a simple list comprehension. Or, even better, a set comprehension to get in calls for free.

line.strip() and line.strip("\n") are probably doing the same thing, unless you really want to preserve the spaces at the end of words.

The csv.writer has a writerowswriterows method that takes an iterable of rows. This way you can avoid a for loop.

from collections import Counter
import csv
from pathlib import Path
import re
import textract

def extract_text(file_name):
    return textract.process(file_name, method='tesseract', language='eng',
                            encoding='utf-8').decode('utf-8')

def extract_words(text):
    return re.findall(r'([a-zA-Z]+)', text)

def count_keywords(words, keywords):
    word_count =return Counter(words)
    return {word: word_count[word] for word in keywordswords if word in word_count}keywords)

def read_keywords(file_name):
    with open(file_name) as f:
        return {line.strip() for line in f}

def save_keywords(file_name, keywords):
    with open(file_name, "w", newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        writer.writerow(['keyword', 'keyword_count'])
        writer.writerows(keywords.items())

def main():
    output_folder = Path("output_results")
    keywords = read_keywords('keywords.txt')

    for f in Path("folderForPdf").glob("*.pdf"):
        words = extract_words(extract_text(f))
        keyword_counts = count_keywords(words, keywords)
        save_keywords(output_folder / f.with_suffix(".csv"), keyword_counts)

if __name__ == "__main__":
    main()

Instead of mucking around with os.getcwd() and os.listdir, I would recommend to use the (Python 3) pathlib.Path object. It supports globbing (to get all files matching a pattern), chaining them to get a new path and even replacing the extension with a different one.

When reading the keywords, you can use a simple list comprehension. Or, even better, a set comprehension to get in calls for free.

The csv.writer has a writerows method that takes an iterable of rows. This way you can avoid a for loop.

from collections import Counter
import csv
from pathlib import Path
import re
import textract

def extract_text(file_name):
    return textract.process(file_name, method='tesseract', language='eng', encoding='utf-8').decode('utf-8')

def extract_words(text):
    return re.findall(r'([a-zA-Z]+)', text)

def count_keywords(words, keywords):
    word_count = Counter(words)
    return {word: word_count[word] for word in keywords if word in word_count}

def read_keywords(file_name):
    with open(file_name) as f:
        return {line.strip() for line in f}

def save_keywords(file_name, keywords):
    with open(file_name, "w", newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        writer.writerow(['keyword', 'keyword_count'])
        writer.writerows(keywords.items())

def main():
    output_folder = Path("output_results")
    keywords = read_keywords('keywords.txt')

    for f in Path("folderForPdf").glob("*.pdf"):
        words = extract_words(extract_text(f))
        keyword_counts = count_keywords(words, keywords)
        save_keywords(output_folder / f.with_suffix(".csv"), keyword_counts)

if __name__ == "__main__":
    main()

Instead of mucking around with os.getcwd() and os.listdir, I would recommend to use the (Python 3) pathlib.Path object. It supports globbing (to get all files matching a pattern), chaining them to get a new path and even replacing the extension with a different one.

When reading the keywords, you can use a simple list comprehension. Or, even better, a set comprehension to get in calls for free.

line.strip() and line.strip("\n") are probably doing the same thing, unless you really want to preserve the spaces at the end of words.

The csv.writer has a writerows method that takes an iterable of rows. This way you can avoid a for loop.

from collections import Counter
import csv
from pathlib import Path
import re
import textract

def extract_text(file_name):
    return textract.process(file_name, method='tesseract', language='eng',
                            encoding='utf-8').decode('utf-8')

def extract_words(text):
    return re.findall(r'([a-zA-Z]+)', text)

def count_keywords(words, keywords):
    return Counter(word for word in words if word in keywords)

def read_keywords(file_name):
    with open(file_name) as f:
        return {line.strip() for line in f}

def save_keywords(file_name, keywords):
    with open(file_name, "w", newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        writer.writerow(['keyword', 'keyword_count'])
        writer.writerows(keywords.items())

def main():
    output_folder = Path("output_results")
    keywords = read_keywords('keywords.txt')

    for f in Path("folderForPdf").glob("*.pdf"):
        words = extract_words(extract_text(f))
        keyword_counts = count_keywords(words, keywords)
        save_keywords(output_folder / f.with_suffix(".csv"), keyword_counts)

if __name__ == "__main__":
    main()

added 58 characters in body

Source Link

edited May 20, 2019 at 10:35

Graipher

41.7k
7
70
134

Loading

Source Link

answered May 20, 2019 at 10:30

Graipher

41.7k
7
70
134

Loading

Stack Exchange Network

Return to Answer