Mispositioned textboxes in Reading doc, pdf files using Apache POI and Apache PDFBox

Question

I am trying to read and process .doc, .docx, .pdf files in Java by converting them into a single string using Apache POI (for doc,docx) and Apache PDFBox (for pdf) libraries.
It works fine until it encounters textboxes. If the format is like this:

paragraph 1
textbox 1
paragraph 2
textbox 2
paragraph 3

Then the output should be:
paragraph 1 textbox 1 paragraph 2 textbox 2 paragraph 3
But the output I am getting is:
paragraph 1 paragraph 2 paragraph 3 textbox 1 textbox 2

It seems to be adding textboxes at the end and not at the place where it should be, ie between the paragraphs. This problem is both in the cases of doc and pdf files. That means both libraries, POI and PDFBox are giving the same problem.

The code for reading pdf file is:

    void pdf(String file) throws IOException {
        //Initialise file
        File myFile = new File(file);
        PDDocument pdDoc = null;
        try {
            //Load PDF
            pdDoc = PDDocument.load(myFile);
            //Create extractor
            PDFTextStripper pdf = new PDFTextStripper();
            //Extract text
            output = pdf.getText(pdDoc);
        }
        finally {
            if(pdDoc != null)
                //Close document
                pdDoc.close();
        }
    }

And code for doc file is:

    void doc(String file) throws FileNotFoundException, IOException {
        File myFile = null;
        WordExtractor extractor = null ;
        //initialise file
        myFile = new File(file);
        //create file input stream
        FileInputStream fis=new FileInputStream(myFile.getAbsolutePath());
        //open document
        HWPFDocument document=new HWPFDocument(fis);
        //create extractor
        extractor = new WordExtractor(document);
        //get text from document
        output = extractor.getText();
    }

You'll probably find the text boxes are anchored to a paragraph and you are inadvertantly moving/losing the anchors. If you can get iText to give you information about the anchor of the text box, then perhaps you can preserve or reset it. — Paul Jowett
– Paul Jowett, Commented Jul 30, 2012 at 4:28
Is there a reason why you're using the underlying libraries directly? Using Apache Tika would likely be much simpler, that wraps the libraries and produces consistent text — Gagravarr
– Gagravarr, Commented Jan 5, 2015 at 13:14

impeto · Accepted Answer · 2012-10-06 01:58:29Z

3

For PDFBox do this: pdf.setSortByPosition(true);

answered Oct 6, 2012 at 1:58

impeto

3506 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Diptman · Accepted Answer · 2018-07-29 05:02:00Z

0

Try below code for pdf. In similar fashion you can try to for doc as well.

void extractPdfTexts(String file) {
    File myFile = new File(file);
    String output;
    try (PDDocument pdDocument = PDDocument.load(myFile)) {
        PDFTextStripper pdfTextStripper = new PDFTextStripper();
        pdfTextStripper.setSortByPosition(true);
        output = pdfTextStripper.getText(pdDocument);
        System.out.println(output);
    } catch (InvalidPasswordException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

answered Jul 29, 2018 at 5:02

Diptman

4044 silver badges15 bronze badges

1 Comment

mkl Over a year ago

The relevant difference to the original code is the setSortByPosition(true) call, and that call had already been pointed out in impeto's answer.

Collectives™ on Stack Overflow

Mispositioned textboxes in Reading doc, pdf files using Apache POI and Apache PDFBox

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related