0

I am working with Itext and I need to extract the font color of particular titles in a PDF, any idea on how to do this?

7
  • @PradeepSimha A concise question like this doesn't need what the asker has tried, more often than not. In other words, there's not much that the OP could have tried without actually acquiring the answer. Commented Dec 21, 2012 at 13:51
  • The answer depends on how the titles were added to the PDF. iText was not really created for this kind of task. Commented Dec 21, 2012 at 13:54
  • @PradeepSimha I've tried to explore the TextRenderInfo and successfully extracted the fontfamily and calculated the size however color is nowhere to be found in the documentation :( Commented Dec 21, 2012 at 14:05
  • @KlasLindbäck would you have any literature regarding this? I've only found comments stating this is hard, but I haven't found why, Commented Dec 21, 2012 at 14:09
  • 1
    @Guevara I have seen explanations in the iText mailing list (it is available on nabble.com). The main problem with extracting information from a pdf is that there are so many ways to construct the same output. The pdf may be a number of images, or the title can (theoretically) be composed of one text segment per letter. The text extraction tool is fairly new and was created to extract just the text (cause that was what the author needed). Commented Dec 21, 2012 at 15:02

2 Answers 2

3

After having spent the last 6 months with iTextSharp (.NET port of iText), I'll try to explain how you can achieve what you want. Although this is not a precise answer, it may very well lead you to a place where you could do additional homework to achieve it.

PDF format keeps an in-memory "graphics state", which is roughly a set of values specifying the current color, pen thickness, line style etc. All rendering operations (including text rendering) use this graphics state to determine what output that operation will produce. For example, you can set the current color to blue, and then draw a few lines and all those lines will be blue. So you don't have to specify the line color for each of those line drawing operations.

For coloring, we have two variables in the graphics state: Current stroking color and current non-stroking color. Stroking color is used for 1-dimensional drawing such as lines, borders etc. (if you have used GDI+, this would roughly be a System.Drawing.Pen), whereas Non-stroking color is used for 2D operations such as Fill (in GDI+, a System.Drawing.Brush). It is my belief that text coloring is done using the non-stroking color, as glyphs are 2D objects.

Right, now that you know the theoretical part, here's what you need to do. Locate PdfContentStreamProcessor class in iText source code. Here you'll find PopulateOperators() that lists all the operators that iText can current parse. There are so many operators related to coloring, the detail of which cannot be summed up here (see PDF specification for more details), but in short the CS and cs operators set the current color space (so many are supported including RGB, Grayscale, CMYK, L*a*b and others) and the SC and sc operators that set the current stroking and non-stroking colors. Again, there's a whole lot of detail about setting color spaces and then interpretting the values of stroking and non-stroking color in the light of current color space for which you should see PDF specs document. Plus there is a couple of push and pop graphics state operations that can complicate things further.

In short, you'll need to add support for operators including CS, cs, G, g, RG, rg, K, k, SC, sc, SCN and scn. Most of them are not currently supported by iTextSharp at this point, so you have to write your own classes for each of them (implementing IContentOperator interface).

You can get a lot of implementation help from here. Although this guy doesn't implement it in all its detail (which I can tell would be a LOT of work, especially handling all the color spaces PDF supports), this should give you a very good starting point.

Hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

0

PDF Clown (natively Java and .NET as well) supports text style information extraction (including text color and text rendering mode) and almost all the graphics operators out of the box (see TextInfoExtractionSample in its codebase).

This open source/free software library features a versatile content engine (see ContentScanner class) capable to perform disparate tasks such as content parsing, content extraction, content editing, content rendering and printing (last one partially developed at the moment).

Its object model is rich and cohesive (just 2 base classes govern all the logic: PdfObject at the root of the primitive low-level PDF types (such as dictionaries, arrays, numbers...); PdfObjectWrapper at the root of the specialized high-level PDF entities (such as pages, annotations, bookmarks...)), mirroring the official PDF Specification without its quirkiness.

I'm its developer so I could possibly be biased, but if you want to get it a try I suggest you to check out from its SVN repository on sourceforge.net, as 0.1.2 version (currently under development) introduces lots of enhancements over the last release.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.