1

The Context

I've been working on a program that gets a pdf, highlights some words (via pdfbox Mark Annotation) and saves the new pdf.

For this I extend the PDFTextStripper class, in order to override the writeString() method and get the TextPositions of each word (box), so that I know exactly where the text is in the PDF doc in terms of coordinates (TextPosition object provides me the coordinates of each word box). Then, based on that, I draw a PDRectangle highlighting the word I want to.

The Problem

It works perfectly for all the documents I've tried so far, except for one that the positions I'm getting from TextPostions seem to be wrong, leading to wrong highlights.

This is the original document:
https://pdfhost.io/v/b1Mcpoy~s_Thomson.pdf

This is the document with a highlighting in the very first word box writeString() provides me, with setSortByPosition(false), which is MicroRNA:
https://pdfhost.io/v/V6INb4Xet_Thomson.pdf
It should highlight MicroRNA, but it is highlighting a blank space above it (pink HL rectangle).

This is the document with a highlighting in the very first word box writeString() provides me, with setSortByPosition(true), which is Original:
https://pdfhost.io/v/Lndh.j6ji_Thomson.pdf
It should highlight Original, but it is highlighting a blank space at the very beginning of the PDF document (pink HL rectangle).

This PDF might contain something that PDFBox struggles to get the right positions, I suppose, or this may be a sort of a bug in PDFBox.

Technical Specification:

PDFBox 2.0.17
Java 11.0.6+10, AdoptOpenJDK
MacOS Catalina 10.15.4, 16gb, x86_64

Coordinates Values

So for instance for the start and end of the MicroRNA word box, the TextPosition coordinates writeString() gives me are:

M letter

endX = 59.533783
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 35.886597
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
unicode = M
direction = -1.0

A Letter

endX = 146.34933
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 129.18181
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
fontSizePt = 23
unicode = A
direction = -1.0

And it results in the wrong HL annotation I shared above, while for all other PDF docs this is just very precise, and I've tested many different ones. I'm clueless here and I'm not an expert on PDF positionings. I've tried to use the PDFbox debugger tool, but I can't read it properly. Any help here would be very appreciated. Let me know if I can provide more evidence. Thanks.

EDIT

Note that text extraction is working just fine.

My Code

First I create an array of coordinates with a few values from TextPosition object of the first and last character I want to HL:

private void extractHLCoordinates(TextPosition firstPosition, TextPosition lastPosition, int pageNumber) {
    double firstPositionX = firstPosition.getX();
    double firstPositionY = firstPosition.getY();
    double lastPositionEndX = lastPosition.getEndX();
    double lastPositionY = lastPosition.getY();

    double height = firstPosition.getHeight();
    double width = firstPosition.getWidth();
    int rotation = firstPosition.getRotation();

    double[] wordCoordinates = {firstPositionX, firstPositionY, lastPositionEndX, lastPositionY, pageNumber, 
    height, width, rotation};

    
    ...
}

Now it's drawing time based on the extracted coordinates:

for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {

    DPage page = pdDocument.getPage(pageIndex);
    List<PDAnnotation> annotations = page.getAnnotations();

    int rotation;
    double pageHeight = page.getMediaBox().getHeight();
    double pageWidth  = page.getMediaBox().getWidth();
    
    // each CoordinatePoint obj holds the double array with the 
    // coordinates of each word I want to HL - see the previous method
    for (CoordinatePoint coordinate : coordinates) {
        double[] wordCoordinates = coordinate.getCoordinates();
        
        int pageNumber = (int) wordCoordinates[4];

        // if the current coordinates are not related to the current page, 
        //ignore them
        if ((int) pageNumber == (pageIndex + 1)) {
            // getting rotation of the page: portrait, landscape...
            rotation = (int) wordCoordinates[7];

            firstPositionX = wordCoordinates[0];
            firstPositionY = wordCoordinates[1];
            lastPositionEndX = wordCoordinates[2];
            lastPositionY = wordCoordinates[3];
            height = wordCoordinates[5];

            double height;
            double minX;
            double maxX;
            double minY;
            double maxY;
            
            if (rotation == 90) {

                double width = wordCoordinates[6];
                width = (pageHeight * width) / pageWidth;

                //defining coordinates of a rectangle
                maxX = firstPositionY;
                minX = firstPositionY - height;
                minY = firstPositionX;
                maxY = firstPositionX + width;
            } else {
                minX = firstPositionX;
                maxX = lastPositionEndX;
                minY = pageHeight - firstPositionY;
                maxY = pageHeight - lastPositionY + height;
            }
                    
            // Finally I draw the Rectangle
            PDAnnotationTextMarkup txtMark = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);

            PDRectangle pdRectangle = new PDRectangle();
            pdRectangle.setLowerLeftX((float) minX);
            pdRectangle.setLowerLeftY((float) minY);
            pdRectangle.setUpperRightX((float) maxX);
            pdRectangle.setUpperRightY((float) ((float) maxY + height));

            txtMark.setRectangle(pdRectangle);

            // And the QuadPoints
            float[] quads = new float[8];
            quads[0] = pdRectangle.getLowerLeftX();  // x1
            quads[1] = pdRectangle.getUpperRightY() - 2; // y1
            quads[2] = pdRectangle.getUpperRightX(); // x2
            quads[3] = quads[1]; // y2
            quads[4] = quads[0];  // x3
            quads[5] = pdRectangle.getLowerLeftY() - 2; // y3
            quads[6] = quads[2]; // x4
            quads[7] = quads[5]; // y5

            txtMark.setQuadPoints(quads);
            ...
        }
    }
11
  • If the pdf was made of images, you shouldn't be able to use the text extraction. I'm not sure though if that's your issue. Commented Oct 14, 2020 at 23:32
  • 1
    Unfortunately you don't show your pivotal code, so it is unclear which pdfbox coordinate normalizations you have considered and which not. Have you for example considered the crop box normalization, cf. this answer? Commented Oct 15, 2020 at 6:45
  • The current version is 2.0.21. Commented Oct 15, 2020 at 7:46
  • 1
    Your Quadpoints coordinates are computed relative to CropBox but they need to be relative to MediaBox. For this document the CropBox is smaller than the MediaBox so the highlight is not in the correct position. Adjust the x with CropBox.LLX - MediaBox.LLY and y with MediaBox.URY-CropBox.URY and the highlight will be in the right position. Commented Oct 15, 2020 at 11:56
  • 1
    No, always relative to MediaBox. But most of the documents have MediaBox=CropBox, so the difference I mentioned is 0. Commented Oct 15, 2020 at 13:39

1 Answer 1

2

Your Quadpoints coordinates are computed relative to CropBox but they need to be relative to MediaBox. For this document the CropBox is smaller than the MediaBox so the highlight is not in the correct position. Adjust the x with CropBox.LLX - MediaBox.LLY and y with MediaBox.URY - CropBox.URY and the highlight will be in the right position.
The adjustment above works for pages with Rotate = 0. If Rotate != 0 then further adjustments might be needed depending on how the coordinates are returned by PDFBox (I'm not very familiar with PDFBox API).

OP EDIT

Posting here the changes I've done to my code so it may help others. Note that I haven't tried anything for rotate == 90 yet. I'll update here once I have this piece.

Before

...
if (rotation == 90) {

    double width = wordCoordinates[6];
    width = (pageHeight * width) / pageWidth;

    //defining coordinates of a rectangle
    maxX = firstPositionY;
    minX = firstPositionY - height;
    minY = firstPositionX;
    maxY = firstPositionX + width;
} else {
    minX = firstPositionX;
    maxX = lastPositionEndX;
    minY = pageHeight - firstPositionY;
    maxY = pageHeight - lastPositionY + height;
}
...

After

...

PDRectangle mediaBox = page.getMediaBox();
PDRectangle cropBox = page.getCropBox();

if (rotation == 90) {

    double width = wordCoordinates[6];
    width = (pageHeight * width) / pageWidth;

    //defining coordinates of a rectangle
    maxX = firstPositionY;
    minX = firstPositionY - height;
    minY = firstPositionX;
    maxY = firstPositionX + width;
} else {
    minX = firstPositionX + cropBox.getLowerLeftX() - mediaBox.getLowerLeftY();
    maxX = lastPositionEndX + cropBox.getLowerLeftX() - mediaBox.getLowerLeftY();
    minY = pageHeight - firstPositionY - (mediaBox.getUpperRightY() - cropBox.getUpperRightY());
    maxY = pageHeight - lastPositionY + height - (mediaBox.getUpperRightY() - cropBox.getUpperRightY());
}
...
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.