9

I am trying to read doc and docx files. here is the code:

  static String distination="E:\\         
  static String docFileName="Requirements.docx";
 public static void main(String[] args) throws FileNotFoundException, IOException {
    // TODO code application logic here
    ReadFile rf= new ReadFile();
    rf.ReadFileParagraph(distination+docFileName);


  }
  public void ReadFileParagraph(String path) throws FileNotFoundException, IOException
    {
        FileInputStream fis;
        File file = new File(path);
        fis=new FileInputStream(file.getAbsolutePath());
           String filename=file.getName();

        String fileExtension=fileExtension(path);
        if(fileExtension.equals("doc"))
        {
             HWPFDocument document=new HWPFDocument(fis);
             WordExtractor DocExtractor = new WordExtractor(document);
             ReadDocFile(DocExtractor,filename);

        }
        else if(fileExtension.equals("docx"))
        {

            XWPFDocument documentX = new XWPFDocument(fis);            
            List<XWPFParagraph> pera =documentX.getParagraphs();
            ReadDocXFile(pera,filename);
        }
        else
        {
            System.out.println("format does not match");
        }

    }
    public void ReadDocFile(WordExtractor extractor,String filename)
    {

        for (String paragraph : extractor.getParagraphText()) {
            System.out.println("Peragraph: "+paragraph);
        }
    }
    public void ReadDocXFile(List<XWPFParagraph> extractor,String filename)
    {

        for (XWPFParagraph paragraph : extractor) {
          System.out.println("Question: "+paragraph.getParagraphText());
        }

    }
    public String fileExtension(String filename)
    {

       String extension = filename.substring(filename.lastIndexOf(".") + 1, filename.length());
       return extension;
    }

this code give an exception when I want to read a docx file:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xmlbeans/XmlException
    at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:52)
    at autometictagdetection.TagDetection.main(TagDetection.java:36)
Caused by: java.lang.ClassNotFoundException: org.apache.xmlbeans.XmlException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    ... 2 more
Java Result: 1

Another problem is when I want to read a Doc file, it read some file very well but for some file it gives an exception like that

    Exception in thread "main" org.apache.poi.hwpf.OldWordFileFormatException: The               document is too old - Word 95 or older. Try HWPFOldDocument instead?
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:222)
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186)
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174)
    at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:44)
    at autometictagdetection.TagDetection.main(TagDetection.java:36)
Java Result: 1

I saw that POI API support word 6 and word 95 in http://poi.apache.org/hwpf/index.html. Please anybody can give a solution of this two problems?

4
  • 2
    The second one pretty much tells you what is wrong. Dont't know the POI API but it only can read Word documents newer than Word 95 with HWPFDocument and you should use HWPFOldDOcument in your code instead. Commented Jul 12, 2013 at 14:26
  • I tried with that. But When I do this HWPFOldDocument document=new HWPFOldDocument(fis); it says "no suitable constructor found for HWPFOldDocument". I also didn't find any document about HWPFOldDocument. Commented Jul 12, 2013 at 14:43
  • 2
    First result popped up when I Googled HWPFOldDocument: poi.apache.org/apidocs/org/apache/poi/hwpf/HWPFOldDocument.html Commented Nov 11, 2015 at 2:48
  • 1
    I think for the first exception you probably need the Apache XMLbeans jar file, add that to your classpath and then try again. Commented Jan 11, 2016 at 8:36

2 Answers 2

3

core maven dependencies required this is the solution to Problem Number 1

<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.15</version>
        </dependency>
        <!-- For .DOCX FILES -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.15</version>
        </dependency>
       <!-- For .DOC FILES -->
        <dependency>
           <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>3.9</version>
        </dependency>

For Problem 2 From the original source code , seems POI doesn't support documents way too old

  /**
   * This constructor loads a Word document from a specific point
   *  in a POIFSFileSystem, probably not the default.
   * Used typically to open embeded documents.
   *
   * @param directory The DirectoryNode that contains the Word document.
   * @throws IOException If there is an unexpected IOException from the passed
   *         in POIFSFileSystem.
   */
  public HWPFDocument(DirectoryNode directory) throws IOException
  {
    // Load the main stream and FIB
    // Also handles HPSF bits
    super(directory);

    // Is this document too old for us?
    if(_fib.getFibBase().getNFib() < 106) {
        throw new OldWordFileFormatException("The document is too old - Word 95 or older. Try HWPFOldDocument instead?");
    }

Source code for HWPFDocument

Sign up to request clarification or add additional context in comments.

Comments

0

Re your first issue, I guess you need to reference the depencencies in your project.

Namely I guess:

poi-ooxml-schemas xmlbeans, which is in poi-ooxml-schemas-version-yyyymmdd.jar

(from the Apache POI page).

Here is the Apache XMLBeans page.

I'm not able to list every library you require, but you can probably figure out through maven...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.