1

I'm trying to read a .doc file with java through the POI library. Here is my code:

FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();

And I have this exception:

java.io.IOException: Unable to read entire header; 162 bytes read; expected 512 bytes
at org.apache.poi.poifs.storage.HeaderBlock.alertShortRead(HeaderBlock.java:226)
at org.apache.poi.poifs.storage.HeaderBlock.readFirst512(HeaderBlock.java:207)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:104)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:138)
at MicrosoftWordParser.getDocString(MicrosoftWordParser.java:277)
at MicrosoftWordParser.main(MicrosoftWordParser.java:86)

My file is not corrupted, i can launch it with microsoft Word.

I'm using poi 3.9 (the latest stable version).

Do you have an idea t solve the problem ?

Thank you.

5 Answers 5

2

readFirst512() will read the first 512 bytes of your Inputstream and throw an exception if there is not enough bytes to read. I think your file is not big enough to be read by POI.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. But what am I supposed to do ?
0

It is probably not a correct Word file. Is it really only 162 bytes long? Check in your filesystem.

I'd recommend creating a new Word file using Word or LibreOffice, and then try to read it using your program.

1 Comment

My file size is 276 ko. I can launch it with word, so it's not corrupted.
0

you should try this programm. package file_opration;

import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {

   file = new File("filepath location");
   FileInputStream fis=new FileInputStream(file.getAbsolutePath());
   HWPFDocument document=new HWPFDocument(fis);
   extractor = new WordExtractor(document);
   String [] fileData = extractor.getParagraphText();
   for(int i=0;i<fileData.length;i++){
     if(fileData[i] != null)
       System.out.println(fileData[i]);
   }
}
catch(Exception exep){}
  }
}

1 Comment

The same exeption is launched :(
0

Ahh, you've got a file, then you're spending loads of memory buffering the whole thing into memory by hiding your file behind an InputStream... Don't! If you have a File, give that to POI. Only give POI an InputStream if that's all your have

Your code should be something like:

 NPOIFSFileSystem fs = new NPOIFSFileSystem(new File("myfile.doc"));
 HWPFDocument document = new HWPFDocument(fs.getRoot());

That'll be quicker and use less memory that reading it into an InputStream, and if there are problems with the file you should normally get slightly more helpful error messages out too

Comments

0

A 162 byte MS Word .doc is probably an "owner file". A temporary file that Word uses to signify the file is locked/owned.

They have a .doc file extension but they are not MS Word Docs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.