Parsing PDF file using Apache PDFBox

Question

I am trying to modify the contents of a PDF document using PDFBox. I used this example as it is, but observed that the text it my PDF file is getting split at character level (or worse). For example, a string,EM? what it is: gets split into:

COSString{E}
COSString{M?}
COSString{ }
COSString{w}
COSString{hat }
COSString{it }
COSString{is}
COSString{:}

(when checked by printing the cosString in the above mentioned code). As far as I can see, there are only Latin characters in the file, and the encoding is also ISO-8859-1. Any ideas?

Regards,

Salil

Joel Westberg · Accepted Answer · 2013-04-01 11:31:14Z

1

This is most likely a PDF formatting issue. That is how your particular PDF stores the text in order to get correct letter spacing or for kerning. This varies greatly from PDF to PDF, depending on how they were created.

Typically, I would suggest simply merging all the different tokens into one big content string.

answered Apr 1, 2013 at 11:31

Joel Westberg

2,7661 gold badge23 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Salil Over a year ago

Yes, I know that I can technically do that, but think about the headache: First create the merged string, then search in it, then for each of the position, go there and replace it... Since I have multiple strings to be replaced, this will cause unnecessary overhead. I tried with 2-3 different files, and got the same problem. I wonder how the example code works for others.

Joel Westberg Over a year ago

I would think there's code out there somewhere that has already solved this problem. I've never used PDFBox for modifying, personally, so I'm not sure.

Salil Over a year ago

I see. Thank you for your help though. It is technically correct. +1 for that. I am not marking it as 'accepted' as of now.

Collectives™ on Stack Overflow

Parsing PDF file using Apache PDFBox

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related