1

I am trying to modify the contents of a PDF document using PDFBox. I used this example as it is, but observed that the text it my PDF file is getting split at character level (or worse). For example, a string,EM? what it is: gets split into:

COSString{E}
COSString{M?}
COSString{ }
COSString{w}
COSString{hat }
COSString{it }
COSString{is}
COSString{:}

(when checked by printing the cosString in the above mentioned code). As far as I can see, there are only Latin characters in the file, and the encoding is also ISO-8859-1. Any ideas?

Regards,

Salil

1 Answer 1

1

This is most likely a PDF formatting issue. That is how your particular PDF stores the text in order to get correct letter spacing or for kerning. This varies greatly from PDF to PDF, depending on how they were created.

Typically, I would suggest simply merging all the different tokens into one big content string.

Sign up to request clarification or add additional context in comments.

3 Comments

Yes, I know that I can technically do that, but think about the headache: First create the merged string, then search in it, then for each of the position, go there and replace it... Since I have multiple strings to be replaced, this will cause unnecessary overhead. I tried with 2-3 different files, and got the same problem. I wonder how the example code works for others.
I would think there's code out there somewhere that has already solved this problem. I've never used PDFBox for modifying, personally, so I'm not sure.
I see. Thank you for your help though. It is technically correct. +1 for that. I am not marking it as 'accepted' as of now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.