0

I'm trying to extract some infos from a set of PDFs. This works so far, but one PDF gives me grievances.

I'm using PDFBox 1.8.8, with Java 7.

PDDocument document = PDDocument.load(pdfFile);
PDFTextStripper stripper = new PDFTextStripper();
System.out.println("File: "+pdfFile.getAbsolutePath()+" readable: "+pdfFile.canRead()+" size: "+pdfFile.length());
System.out.println(stripper.getText(document));

It just prints

File: /foo/bar/mypdf.pdf readable: true size: 1267743

Then it terminates. Usually I use the writeText method and funnel the text through a stream, but above code was used for simplification. I've tried converting the PDF with pdftotext - it works just like the others.

I get no exception, no nothing. Any ideas?

EDIT: Additional Info: Created with Acrobat Distiller 9.0.0 (Windows), Format PDF-1.6; The other PDFs are Version 1.4 and 1.5

Doesn't seem to contain exotic characters. I can mark/copy text in Evince PDF-viewer

EDIT2:

Dang it. File property dialog (Nautilus) said "Security: No", but pdfinfo gives me:

Encrypted:      yes (print:yes copy:no change:no addNotes:no algorithm:AES)

Anyway to circumvent that? After all, pdftotext could get the text out.

10
  • 1
    Can you share the PDF to reproduce the issue? And... Then it terminates - have you tried enclosing System.out.println(stripper.getText(document)); in try { ... } catch (Throwable t) { t.printStackTrace(); }? Commented Feb 10, 2015 at 13:29
  • PdfBox sometimes fails if the pdf contains non-Latin characters. Is it the case ?? Commented Feb 10, 2015 at 13:41
  • @mkl I'm afraid I can't :( It's work related. It terminates, bc. I put a System.exit(1) after the code above. But it should print something first. Did try the catch-all, but nothing. Commented Feb 10, 2015 at 13:54
  • @SurajeetBharati it doesn't, as far as I've seen. But I just checked and saw, that it's the only PDF in the 1.6 Format. The others are 1.4, 1.5. Does PDFbox support that? Can't find anything. Commented Feb 10, 2015 at 13:55
  • PDFBox1.8.8 works for PDF v1.6 Reference. There must be some other cases. Commented Feb 10, 2015 at 14:01

1 Answer 1

1

The document was "encrypted" (write protected), but with no user password set. This Stackoverflow answer shows how you can remove the encryption and simply read the file: remove encryption from pdf with pdfbox, like qpdf

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.