Python, pyPdf, Adobe PDF OCR error: unsupported filter /lzwdecode

Question

My stuff: python 2.6 64 bit (with pyPdf-1.13.win32.exe installed). Wing IDE. Windows 7 64 bit.

I got the following error:

NotImplementedError: unsupported filter /LZWDecode

When I ran the following code:

from pyPdf import PdfFileWriter, PdfFileReader
import sys, os, pyPdf, re

path = 'C:\\Users\\Homer\\Documents\\' # This is where I put my pdfs

filelist = os.listdir(path)

has_text_list = []
does_not_have_text_list = []

for pdf_name in filelist:
    pdf_file_with_directory = os.path.join(path, pdf_name)
    pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb'))

    for i in range(0, pdf.getNumPages()):
        content = pdf.getPage(i).extractText() #this is the line what done it
        does_it_have_text = re.findall(r'\w{2,}', content) 
        if does_it_have_text == []:
            does_not_have_text_list.append(pdf_name)
            print pdf_name
        else:
            has_text_list.append(pdf_name)

print does_not_have_text_list

Here's a little background. The path is full of pdfs. Some were saved from text documents using the Adobe pdf printer (at least I think that's how they did it). And some were scanned as images. I wanted to separate them and OCR the ones that are images (the non-image ones are perfect and ought not to be messed with).

I asked here a few days ago how to do that:

Batch OCR Program for PDFs

The only respose I got was in VB, and I only speaky the python. So I figured I would try to write an answer to my own question. My strategy (reflected in the code above) is like this. If it's just an image, then that regular expression will return an empty list. If it has text, the regular expression (says any word with 2 or more alphanumeric characters) will return a list populated with stuff like u'word' (in python, I think that's a unicode string).

So the code should work, and we can take the first step to finish off that other thread using open source software (separating the ocrd from imaged pdfs), but I don't know how to deal with this filter error and googling wasn't helpful. So if anyone knows, would be quite helpful.

I don't really know how to use this stuff. I'm not sure what filter means in pyPdf speak. I think it' saying that it can't really read the pdf or something, even though it's ocrd. Funnily, I put one of the non-ocrd and one of the ocrd pdfs in the same folder as a python file and this worked on just the one without the for loop, so I don't know why doing them with the for loop created the filter errror. I'll post the single code below. THX.

from pyPdf import PdfFileWriter, PdfFileReader
import sys, os, pyPdf, re

pdf = pyPdf.PdfFileReader(open(my_ocrd_file.pdf', 'rb'))

has_text_list = []
does_not_have_text_list = []

for i in range(0, pdf.getNumPages()):
    content = pdf.getPage(i).extractText()
    does_it_have_text = re.findall(r'\w{2,}', content)
      print does_it_have_text

and it prints stuff, so I don't know why I get a filter error on one and not the other. When I run this code against the other file in the directory (the one that's NOT ocrd), the output is an emptry string on one line and an emptry string on the next, like so:

[]
[]

So I don't guess it's a filter problem with the non-ocrd pdfs either. This is like over my head and I need some help here.

Edit:

Google search found this, but I don't know what to make of it:

http://vaitls.com/treas/pdf/pyPdf/filters.py

Renaud · Accepted Answer · 2011-07-04 13:52:49Z

2

Replace pyPdf's filter.py with http://vaitls.com/treas/pdf/pyPdf/filters.py in your pyPdf source folder. That worked for me.

answered Jul 4, 2011 at 13:52

Renaud

16.6k7 gold badges83 silver badges81 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PatentDeathSquad Over a year ago

Thx, I'll give that a shot. Actually I found that f = open('pdf', 'rb') and then searching for the word "Font" worked easier. (In case you found this while working on something similar).

lafras · Accepted Answer · 2011-05-20 09:17:54Z

1

LZW is a compression format used in GIFs and sometimes in PDFs. If you look at the filters available in pyPdf.filters you'll see that LZW is not there, hence the NotImplementedError. The link you posted is to code in a subversion repository where someone has implemented a LZW filter.

answered May 20, 2011 at 9:17

lafras

9,2464 gold badges31 silver badges29 bronze badges

3 Comments

PatentDeathSquad Over a year ago

Thx. Actually, I think that the image-only files will have no filters applied, so I can write a "try:" with an empty "except:" and append to the list of OCR'd files any file that raises any exception (I was getting another exception for unrecognize characters).

lafras Over a year ago

@AquaT33nFan: Sounds like a plan.

Sumit Kumar Saha Over a year ago

LZW is applied in most of the PDF's not in few.

Collectives™ on Stack Overflow

Python, pyPdf, Adobe PDF OCR error: unsupported filter /lzwdecode

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related