I'm trying to extract some infos from a set of PDFs. This works so far, but one PDF gives me grievances.
I'm using PDFBox 1.8.8, with Java 7.
PDDocument document = PDDocument.load(pdfFile);
PDFTextStripper stripper = new PDFTextStripper();
System.out.println("File: "+pdfFile.getAbsolutePath()+" readable: "+pdfFile.canRead()+" size: "+pdfFile.length());
System.out.println(stripper.getText(document));
It just prints
File: /foo/bar/mypdf.pdf readable: true size: 1267743
Then it terminates. Usually I use the writeText method and funnel the text through a stream, but above code was used for simplification. I've tried converting the PDF with pdftotext - it works just like the others.
I get no exception, no nothing. Any ideas?
EDIT: Additional Info: Created with Acrobat Distiller 9.0.0 (Windows), Format PDF-1.6; The other PDFs are Version 1.4 and 1.5
Doesn't seem to contain exotic characters. I can mark/copy text in Evince PDF-viewer
EDIT2:
Dang it. File property dialog (Nautilus) said "Security: No", but pdfinfo gives me:
Encrypted: yes (print:yes copy:no change:no addNotes:no algorithm:AES)
Anyway to circumvent that? After all, pdftotext could get the text out.
System.out.println(stripper.getText(document));intry { ... } catch (Throwable t) { t.printStackTrace(); }?