7

I am trying to extract text with all information from the pdf using pdfbox. I got all the information i want, except color. I tried different ways to get the fontcolor (including Getting Text Colour with PDFBox). But not working. And now I copied code from PageDrawer class of pdfBox. But then also the RGB value is not correct.

protected void processTextPosition(TextPosition text) {

        Composite com;
        Color col;
        switch(this.getGraphicsState().getTextState().getRenderingMode()) {
        case PDTextState.RENDERING_MODE_FILL_TEXT:
            com = this.getGraphicsState().getNonStrokeJavaComposite();
            int r =       this.getGraphicsState().getNonStrokingColor().getJavaColor().getRed();
            int g = this.getGraphicsState().getNonStrokingColor().getJavaColor().getGreen();
            int b = this.getGraphicsState().getNonStrokingColor().getJavaColor().getBlue();
            int rgb = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB();
            float []cosp = this.getGraphicsState().getNonStrokingColor().getColorSpaceValue();
            PDColorSpace pd = this.getGraphicsState().getNonStrokingColor().getColorSpace();
            break;
        case PDTextState.RENDERING_MODE_STROKE_TEXT:
            System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString());
            System.out.println(this.getGraphicsState().getStrokingColor().getJavaColor().getRGB());
           break;
        case PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT:
            //basic support for text rendering mode "invisible"
            Color nsc = this.getGraphicsState().getStrokingColor().getJavaColor();
            float[] components = {Color.black.getRed(),Color.black.getGreen(),Color.black.getBlue()};
            Color  c1 = new Color(nsc.getColorSpace(),components,0f);
            System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString());
            break;
        default:
            System.out.println(this.getGraphicsState().getNonStrokeJavaComposite().toString());
            System.out.println(this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB());
    }

I am using the above code. The values getting are r = 0, g = 0, b = 0, inside cosp object value is [0.0], inside pd object array = null and colorSpace = null. and RGB value is always -16777216. Please help me. Thanks in advance.

2
  • 1
    I see you are getting black, what color are you expecting ? Commented Dec 24, 2012 at 12:29
  • Something other than black... he is expecting colors which correspond with the text color. After trying this solution, I only got black as well. Commented Apr 7, 2021 at 20:37

5 Answers 5

5
+150

I tried the code in the link you posted and it worked for me. The colors I get back are 148.92, 179.01001 and 214.965. I wish I could give you my PDF to work with, maybe if I store it externally to SO? My PDF used a sort of palish blue color and that seems to match. It was just one page of text created in Word 2010 and exported, nothing too intense.

A couple of suggestions ....

  1. Recall that the value returned is a float between 0 and 1. If a value is accidentally cast to int, then of course the values will end up containing nearly all 0. The linked to code multiples by 255 to get a range of 0 to 255.
  2. As the commenter said, the most common color for a PDF file is black which is 0 0 0

That is all I can think of now, otherwise I have version of 1.7.1 of pdfbox and fontbox and like I said I pretty much followed the link you gave.

EDIT

Based upon my comments, here perhaps is a minorly invasive way of doing it for pdf files like color.pdf?

In PDFStreamEngine.java in the processOperator method one can do inside the try block

if (operation.equals("RG")) {
   // stroking color space
   System.out.println(operation);
   System.out.println(arguments);
} else if (operation.equals("rg")) {
   // non-stroking color space
   System.out.println(operation);
   System.out.println(arguments);
} else if (operation.equals("BT")) {
   System.out.println(operation);    
} else if (operation.equals("ET")) {
   System.out.println(operation);           
}

This will show you the information, then it is up to you to process the color information for each section according to your needs. Here is a snippet from the beginning of the output of the above code when run on color.pdf ...

BT rG [COSInt(1), COSInt(0), CosInt(0)] RG [COSInt(1), COSInt(0), CosInt(0)] ET BT ET BT rG [COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}] RG [COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}] ET ......

You see in the above output an empty BT ET section, this being a section which is marked DEVICEGRAY. All the other give you [0,1] values for the R, G and B components

Sign up to request clarification or add additional context in comments.

10 Comments

But it is not working for me. I solved this issue by recreating graphic object. I overrided all the classes like public PDRectangle findCropBox(PDPage pg), public PDRectangle findMediaBox(PDPage pg), public PDRectangle getMediaBox(PDPage pg), private PDRectangle findParentCropBox(PDPageNode node), public int findRotation(PDPage pg), public Integer getRotation(PDPage pg), public PDRectangle getCropBox(PDPage pg), public PDPageNode getParent(PDPage pg), and then I recreated graphic object in my class. Frankly I don't know what I did. But it worked for me. I will check again with your guidelines.
I tried the code again. But still the out put is :: 25 Dec, 2012 2:20:01 PM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: BDC 25 Dec, 2012 2:20:10 PM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: EMC DeviceGray 0.0
I used the same code from the link stackoverflow.com/questions/5861471/…
@demongolem i am not able to get color using above code.Its not working.
@demongolem Thank you. We have also done almost the same thing and we are extracting the color now. However as we all know that editing the source code to arrive at this solution is not very elegant. I am accepting your answer. I hope that PDFBox people will see it and hope they will give us a method to get the color information out.
|
5

I also ended up doing something like this. Pasting code below, hope it helps someone.

import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.pdmodel.graphics.PDGraphicsState;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.ResourceLoader;
import org.apache.pdfbox.util.TextPosition;

public class Parser extends PDFTextStripper {

public Parser() throws IOException {
    super(ResourceLoader.loadProperties(
            "org/apache/pdfbox/resources/PageDrawer.properties", true));
    super.setSortByPosition(true);
}

public void parse(String path) throws IOException{
    PDDocument doc = PDDocument.load(path);
    List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
    for (PDPage page : pages) {
        this.processStream(page, page.getResources(), page.getContents().getStream());
    }
}

@Override
protected void processTextPosition(TextPosition text) {
    try {
        PDGraphicsState graphicsState = getGraphicsState();
        System.out.println("R = " + graphicsState.getNonStrokingColor().getJavaColor().getRed());
        System.out.println("G = " + graphicsState.getNonStrokingColor().getJavaColor().getGreen());
        System.out.println("B = " + graphicsState.getNonStrokingColor().getJavaColor().getBlue());
    }
    catch (IOException ioe) {}

}

public static void main(String[] args) throws IOException, COSVisitorException {
    Parser p = new Parser();
    p.parse("/Users/apple/Desktop/123.pdf");
}

}

Comments

3

I found some code in one of my maintenance program.
I do not know it works for you or not, please try It. Also check out this link http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/common/class-use/PDStream.html

It may help you

PDDocument doc = null;
try {
    doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
    PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
    PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
    engine.processStream(page, page.findResources(), page.getContents().getStream());
    PDGraphicsState graphicState = engine.getGraphicsState();
    System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
    float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
    for (float c : colorSpaceValues) {
        System.out.println(c * 255);
    }
}
finally {
    if (doc != null) {
        doc.close();
    }

Comments

1

With the pdfbox verson 2.0+ it is necessary to choose these operators in the constructor of your overwritten PDFTextStripper:

addOperator(new SetStrokingColorSpace());
addOperator(new SetNonStrokingColorSpace());
addOperator(new SetStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceRGBColor());
addOperator(new SetStrokingDeviceRGBColor());
addOperator(new SetNonStrokingDeviceGrayColor());
addOperator(new SetStrokingDeviceGrayColor());
addOperator(new SetStrokingColor());
addOperator(new SetStrokingColorN());
addOperator(new SetNonStrokingColor());
addOperator(new SetNonStrokingColorN());

Only then getGraphicsState() will return proper information.

See https://pdfbox.apache.org/2.0/migration.html

Comments

0

Here is PdfBox - How to load color from text which should be able to answer your question with a much simpler solution than these other answers :).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.