8

I have an word document Docx file

As you can see in the word document there are a number of questions with Bullet Points. Right now I am trying to extract each paragraph from the file using apache POI. Here is my current code

    public static String readDocxFile(String fileName) {
    try {
        File file = new File(fileName);
        FileInputStream fis = new FileInputStream(file.getAbsolutePath());
        XWPFDocument document = new XWPFDocument(fis);

        List<XWPFParagraph> paragraphs = document.getParagraphs();
        String whole = "";
        for (XWPFParagraph para : paragraphs) {
            System.out.println(para.getText());
            whole += "\n" + para.getText();
        }
        fis.close();
        document.close();
        return whole;
    } catch (Exception e) {
        e.printStackTrace();
        return "";
    }
    }

The problem with above method is that it is printing each line instead of paragraphs. Also the bullet points are also gone from extracted whole string. The whole is returned a plain string.

Can anyone explain what I am doing wrong. Also please suggest if you have a better idea to solve it.

8
  • 1
    what are you trying to achieve as an end result? Commented Feb 5, 2018 at 11:15
  • @hovanessyan I am trying to get each paragraph or question as a separate string.So basically I am trying to convert this docx file into an array of string where each string is a paragraph. Commented Feb 5, 2018 at 11:36
  • There could be multiple ways to achieve an end result - that's why I am asking what's the desired outcome. You're writing a program to solve a problem, not to have an array full of strings - what is the problem you're trying to solve? Commented Feb 5, 2018 at 11:42
  • 1
    Which version of Apache POI do you refer to in your classpath / project? Commented Feb 6, 2018 at 12:53
  • For reference: poi.apache.org/changes.html Commented Feb 6, 2018 at 12:59

2 Answers 2

1

Above code is correct and I ran your code on my system that giving each and every paragraphs , I think problem with writting content on docx file whenever I wrote content in bullet points and uses 'enter' key than that breaks my current bullet points and above code make that breaked-line as saparate paragraph.

I am writting below code sample may be It's useful for you take a look here I am using Set datastructure for ignoring duplicate questions from docx .

Dependency of apache poi is below

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.7</version>
</dependency>

Code Sample :

package com;

import java.io.File;
import java.io.FileInputStream;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.springframework.util.ObjectUtils;

public class App {

    public static void main(String...strings) throws Exception{
        Set<String> bulletPoints = fileExtractor(); 
        bulletPoints.forEach(point -> {
            System.out.println(point);
        });
    }

    public static Set<String> fileExtractor() throws Exception{
        FileInputStream fis = null;
        try {
            Set<String> bulletPoints = new HashSet<>();
            File file = new File("/home/deskuser/Documents/query.docx");
            fis = new FileInputStream(file.getAbsolutePath());
            XWPFDocument document = new XWPFDocument(fis);

            List<XWPFParagraph> paragraphs = document.getParagraphs();
            paragraphs.forEach(para -> {
                System.out.println(para.getText());
                if(!ObjectUtils.isEmpty(para.getText())){
                    bulletPoints.add(para.getText());
                }
            });
            fis.close();

            return bulletPoints;
        } catch (Exception e) {
            e.printStackTrace();
            throw new Exception("error while extracting file.", e);
        }finally{
            if(!ObjectUtils.isEmpty(fis)){
                fis.close();
            }
        }
    }
}
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the answer. I have tried it but it does the same as my above code. Have you tried testing with the provided file?
I have tested code for your provided file code is working fine as per apache poi designed but if you wanna saparate each question along with their answer than you have to some changes in your docs while writting your question and answer than saprate using program while reading.
-1

I couldn't find which version of apache POI you are using. If it's the latest version (3.17), the XWPFParagraph object used in your code has a getNumFmt() method. From the apache poi documentation (https://poi.apache.org/apidocs/org/apache/poi/xwpf/usermodel/XWPFParagraph.html) this method will return the string "bullet" if the paragraph starts with a bullet. So regarding the second point of your question (what happens to the bullets), you can resolve with something like the following:

public class TestPoi {

    private static final String BULLET = "•";

    private static final String NEWLINE = "\n";

    public static void main(String...args) {
        String test = readDocxFile("/home/william/Downloads/anesthesia.docx");
        System.out.println(test);
    }

    public static String readDocxFile(String fileName) {
        try {
            File file = new File(fileName);
            FileInputStream fis = new FileInputStream(file.getAbsolutePath());
            XWPFDocument document = new XWPFDocument(fis);

            List<XWPFParagraph> paragraphs = document.getParagraphs();
            StringBuilder whole = new StringBuilder();
            for (XWPFParagraph para : paragraphs) {
                if ("bullet".equals(para.getNumFmt())) {
                    whole.append(BULLET);
                }
                whole.append(para.getText());
                whole.append(NEWLINE);
            }
            fis.close();
            document.close();
            return whole.toString();
        } catch (Exception e) {
            e.printStackTrace();
            return "";
        }
    }
}

Regarding your first point, what is the expected output? I ran your code with the provided docx and apart from the missing bullets you mentioned, it looked okay stepping through with the debugger.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.