0

I am trying to read and process .doc, .docx, .pdf files in Java by converting them into a single string using Apache POI (for doc,docx) and Apache PDFBox (for pdf) libraries.
It works fine until it encounters textboxes. If the format is like this:

paragraph 1
textbox 1
paragraph 2
textbox 2
paragraph 3

Then the output should be:
paragraph 1 textbox 1 paragraph 2 textbox 2 paragraph 3
But the output I am getting is:
paragraph 1 paragraph 2 paragraph 3 textbox 1 textbox 2

It seems to be adding textboxes at the end and not at the place where it should be, ie between the paragraphs. This problem is both in the cases of doc and pdf files. That means both libraries, POI and PDFBox are giving the same problem.

The code for reading pdf file is:

    void pdf(String file) throws IOException {
        //Initialise file
        File myFile = new File(file);
        PDDocument pdDoc = null;
        try {
            //Load PDF
            pdDoc = PDDocument.load(myFile);
            //Create extractor
            PDFTextStripper pdf = new PDFTextStripper();
            //Extract text
            output = pdf.getText(pdDoc);
        }
        finally {
            if(pdDoc != null)
                //Close document
                pdDoc.close();
        }
    }

And code for doc file is:

    void doc(String file) throws FileNotFoundException, IOException {
        File myFile = null;
        WordExtractor extractor = null ;
        //initialise file
        myFile = new File(file);
        //create file input stream
        FileInputStream fis=new FileInputStream(myFile.getAbsolutePath());
        //open document
        HWPFDocument document=new HWPFDocument(fis);
        //create extractor
        extractor = new WordExtractor(document);
        //get text from document
        output = extractor.getText();
    }

2
  • You'll probably find the text boxes are anchored to a paragraph and you are inadvertantly moving/losing the anchors. If you can get iText to give you information about the anchor of the text box, then perhaps you can preserve or reset it. Commented Jul 30, 2012 at 4:28
  • Is there a reason why you're using the underlying libraries directly? Using Apache Tika would likely be much simpler, that wraps the libraries and produces consistent text Commented Jan 5, 2015 at 13:14

2 Answers 2

3

For PDFBox do this: pdf.setSortByPosition(true);

Sign up to request clarification or add additional context in comments.

Comments

0

Try below code for pdf. In similar fashion you can try to for doc as well.

void extractPdfTexts(String file) {
    File myFile = new File(file);
    String output;
    try (PDDocument pdDocument = PDDocument.load(myFile)) {
        PDFTextStripper pdfTextStripper = new PDFTextStripper();
        pdfTextStripper.setSortByPosition(true);
        output = pdfTextStripper.getText(pdDocument);
        System.out.println(output);
    } catch (InvalidPasswordException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

1 Comment

The relevant difference to the original code is the setSortByPosition(true) call, and that call had already been pointed out in impeto's answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.