I am trying to read and process .doc, .docx, .pdf files in Java by converting them into a single string using Apache POI (for doc,docx) and Apache PDFBox (for pdf) libraries.
It works fine until it encounters textboxes.
If the format is like this:
paragraph 1
textbox 1
paragraph 2
textbox 2
paragraph 3
Then the output should be:
paragraph 1 textbox 1 paragraph 2 textbox 2 paragraph 3
But the output I am getting is:
paragraph 1 paragraph 2 paragraph 3 textbox 1 textbox 2
It seems to be adding textboxes at the end and not at the place where it should be, ie between the paragraphs. This problem is both in the cases of doc and pdf files. That means both libraries, POI and PDFBox are giving the same problem.
The code for reading pdf file is:
void pdf(String file) throws IOException {
//Initialise file
File myFile = new File(file);
PDDocument pdDoc = null;
try {
//Load PDF
pdDoc = PDDocument.load(myFile);
//Create extractor
PDFTextStripper pdf = new PDFTextStripper();
//Extract text
output = pdf.getText(pdDoc);
}
finally {
if(pdDoc != null)
//Close document
pdDoc.close();
}
}
And code for doc file is:
void doc(String file) throws FileNotFoundException, IOException {
File myFile = null;
WordExtractor extractor = null ;
//initialise file
myFile = new File(file);
//create file input stream
FileInputStream fis=new FileInputStream(myFile.getAbsolutePath());
//open document
HWPFDocument document=new HWPFDocument(fis);
//create extractor
extractor = new WordExtractor(document);
//get text from document
output = extractor.getText();
}