PDFBox text extraction - empty output

Question

I'm trying to extract some infos from a set of PDFs. This works so far, but one PDF gives me grievances.

I'm using PDFBox 1.8.8, with Java 7.

PDDocument document = PDDocument.load(pdfFile);
PDFTextStripper stripper = new PDFTextStripper();
System.out.println("File: "+pdfFile.getAbsolutePath()+" readable: "+pdfFile.canRead()+" size: "+pdfFile.length());
System.out.println(stripper.getText(document));

It just prints

File: /foo/bar/mypdf.pdf readable: true size: 1267743

Then it terminates. Usually I use the writeText method and funnel the text through a stream, but above code was used for simplification. I've tried converting the PDF with pdftotext - it works just like the others.

I get no exception, no nothing. Any ideas?

EDIT: Additional Info: Created with Acrobat Distiller 9.0.0 (Windows), Format PDF-1.6; The other PDFs are Version 1.4 and 1.5

Doesn't seem to contain exotic characters. I can mark/copy text in Evince PDF-viewer

EDIT2:

Dang it. File property dialog (Nautilus) said "Security: No", but pdfinfo gives me:

Encrypted:      yes (print:yes copy:no change:no addNotes:no algorithm:AES)

Anyway to circumvent that? After all, pdftotext could get the text out.

Can you share the PDF to reproduce the issue? And... Then it terminates - have you tried enclosing System.out.println(stripper.getText(document)); in try { ... } catch (Throwable t) { t.printStackTrace(); }? — mkl
– mkl, Commented Feb 10, 2015 at 13:29
PdfBox sometimes fails if the pdf contains non-Latin characters. Is it the case ?? — Surajeet Bharati
– Surajeet Bharati, Commented Feb 10, 2015 at 13:41
@mkl I'm afraid I can't :( It's work related. It terminates, bc. I put a System.exit(1) after the code above. But it should print something first. Did try the catch-all, but nothing. — Benjamin Maurer
– Benjamin Maurer, Commented Feb 10, 2015 at 13:54
@SurajeetBharati it doesn't, as far as I've seen. But I just checked and saw, that it's the only PDF in the 1.6 Format. The others are 1.4, 1.5. Does PDFbox support that? Can't find anything. — Benjamin Maurer
– Benjamin Maurer, Commented Feb 10, 2015 at 13:55
PDFBox1.8.8 works for PDF v1.6 Reference. There must be some other cases. — Surajeet Bharati
– Surajeet Bharati, Commented Feb 10, 2015 at 14:01

Community · Accepted Answer · 2017-05-23 10:24:20Z

1

The document was "encrypted" (write protected), but with no user password set. This Stackoverflow answer shows how you can remove the encryption and simply read the file: remove encryption from pdf with pdfbox, like qpdf

edited May 23, 2017 at 10:24

CommunityBot

11 silver badge

answered Feb 24, 2015 at 21:50

Benjamin Maurer

3,7935 gold badges31 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PDFBox text extraction - empty output

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related