Read PDF through Java and get the HTML Content

Question

I want to read an existing PDF file, get not only the text, but also the format information like: Font (Bold, Italic),paragraphs,images, tables. Basically I want to write an HTML similar to PDF.

Is there an code library for doing this? I am looking for an Open Source Library.

Regards, Tina Agrawal

What about a PDF made from scanned images? Does it contain text? — Ingo
– Ingo, Commented May 21, 2010 at 10:17
The PDF contains all Text, images and tables. It might be possible that a word document is converted to PDF — Tina Agrawal
– Tina Agrawal, Commented May 21, 2010 at 10:35

Mick MacCallum · Accepted Answer · 2012-10-06 06:31:15Z

3

Try the PDFBox or iText. They are open source, and can handle text, images ,tables, etc.

edited Oct 6, 2012 at 6:31

Mick MacCallum

130k40 gold badges283 silver badges284 bronze badges

answered Aug 22, 2012 at 11:46

Hatter Bush

2251 gold badge4 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mark stephens · Accepted Answer · 2012-08-22 12:27:27Z

0

If you want an exact version of the page, you may need to create an image of the page and put invisble text on it. Can can see some idea of what is possible on our blog at http://www.jpedal.org/PDFblog/2012/08/4-ways-to-convert-pdf-to-html5/ with PDF to HTML conversion.

answered Aug 22, 2012 at 12:27

mark stephens

3,16820 silver badges19 bronze badges

Collectives™ on Stack Overflow

Read PDF through Java and get the HTML Content

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related