python pdfminer converts pdf file into one chunk of string with no spaces between words

Question

I was using the following code mainly taken from DuckPuncher's answer to this post Extracting text from a PDF file using PDFMiner in python? to convert pdfs to text files:

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

The pdfs are downloaded and stored in my local directory using the following code and stored in my local directory. It worked fine.

import requests
url = 'link_to_the_pdf'
file_name = './name.pdf'
response = requests.get(url)
with open(file_name, 'wb') as f:
    f.write(response.content)

However, for some pdfs, the convert_pdf_to_txt() returned the content as almost one chunk of string with no spaces between words. For example, after downloading the following pdf from http://www.ece.rochester.edu/~gsharma/papers/LocalImageRegisterEI2005.pdf, and applying the convert_pdf_to_txt() function, I got a text file in which the words are not separated by spaces. An excerpt of the text file is

3Predominantmethodsinthelattergrouparefromcomputervisionarea,e.g.,plane+p arallax4methodfor3-Dscenestructurecomputation.Inthispaper,weproposeanewlocalimageregistrationtechnique,intheﬁrstclass,basedonadaptiveﬁlteringtechniques.Adaptiveﬁltershavebeenutilizedsuccessfullyforsystemidentiﬁcationpurposesin1-D.

Can someone help me fix this problem please? Is it the format of this particular pdf that's causing the problem or something else, because with some other pdfs, the convert_pdf_to_txt() function is working fine.

The link you give is broken (it's not blue colored completely) and leads to a non-PDF page. Can you give the link to the PDF-example you are interested in ? — pyano
– pyano, Commented Mar 27, 2018 at 4:28
@pyano Yes, sorry about the broken link. I have edited the link in the post. Now it should work. Thank you for helping! — Yue Zhao
– Yue Zhao, Commented Mar 27, 2018 at 18:14

ejames · Accepted Answer · 2019-04-09 13:11:57Z

According to this thread some pdfs mark the entire text as figure and by default PDFMiner doesn't try to perform layout analysis for figure text. To override this behavior the all_texts parameter needs to be set to True.

Here is an example that works for me based on this post.

import io

import pdfminer
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

# Perform layout analysis for all text
laparams = pdfminer.layout.LAParams()
setattr(laparams, 'all_texts', True)

def extract_text_from_pdf(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle, laparams=laparams)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)

    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    # close open handles
    converter.close()
    fake_file_handle.close()

    if text:
        return text


text = extract_text_from_pdf('test.pdf')

Collectives™ on Stack Overflow

python pdfminer converts pdf file into one chunk of string with no spaces between words

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related