1

I want to read an existing PDF file, get not only the text, but also the format information like: Font (Bold, Italic),paragraphs,images, tables. Basically I want to write an HTML similar to PDF.

Is there an code library for doing this? I am looking for an Open Source Library.

Regards, Tina Agrawal

2
  • What about a PDF made from scanned images? Does it contain text? Commented May 21, 2010 at 10:17
  • The PDF contains all Text, images and tables. It might be possible that a word document is converted to PDF Commented May 21, 2010 at 10:35

2 Answers 2

3

Try the PDFBox or iText. They are open source, and can handle text, images ,tables, etc.

Sign up to request clarification or add additional context in comments.

Comments

0

If you want an exact version of the page, you may need to create an image of the page and put invisble text on it. Can can see some idea of what is possible on our blog at http://www.jpedal.org/PDFblog/2012/08/4-ways-to-convert-pdf-to-html5/ with PDF to HTML conversion.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.