0

I have used the following code to extract text from .odt files:

public class OpenOfficeParser {

StringBuffer TextBuffer;

public OpenOfficeParser() {}

//Process text elements recursively
public void processElement(Object o) {

    if (o instanceof Element) {

        Element e = (Element) o;
        String elementName = e.getQualifiedName();

        if (elementName.startsWith("text")) {

            if (elementName.equals("text:tab")) // add tab for text:tab
                TextBuffer.append("\\t");
            else if (elementName.equals("text:s"))  // add space for text:s
                TextBuffer.append(" ");
            else {
                List children = e.getContent();
                Iterator iterator = children.iterator();

                while (iterator.hasNext()) {

                    Object child = iterator.next();
                    //If Child is a Text Node, then append the text
                    if (child instanceof Text) { 
                        Text t = (Text) child;
                        TextBuffer.append(t.getValue());
                    }
                    else
                    processElement(child); // Recursively process the child element                   
                }                   
            }
            if (elementName.equals("text:p"))
                TextBuffer.append("\\n");                   
        }
        else {
            List non_text_list = e.getContent();
            Iterator it = non_text_list.iterator();
            while (it.hasNext()) {
                Object non_text_child = it.next();
                processElement(non_text_child);                   
            }
        }               
    }
}

public String getText(String fileName) throws Exception {
    TextBuffer = new StringBuffer();

    //Unzip the openOffice Document
    ZipFile zipFile = new ZipFile(fileName);
    Enumeration entries = zipFile.entries();
    ZipEntry entry;

    while(entries.hasMoreElements()) {
        entry = (ZipEntry) entries.nextElement();

        if (entry.getName().equals("content.xml")) {

            TextBuffer = new StringBuffer();               
            SAXBuilder sax = new SAXBuilder();
            Document doc = sax.build(zipFile.getInputStream(entry));
            Element rootElement = doc.getRootElement();
            processElement(rootElement);
            break;
        }
    }    


 System.out.println("The text extracted from the OpenOffice document = " + TextBuffer.toString());
        return TextBuffer.toString();       
    }     
}

now my problem occurs when using the returned string from getText() method. I ran the program and extracted some text from a .odt, here is a piece of extracted text:

(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....

So I tried this

System.out.println( TextBuffer.toString().split("\\n")); 

the output I received was:

substring: [Ljava.lang.String;@505bb829

I also tried this:

System.out.println( TextBuffer.toString().trim() );

but no changes in the printed string.

Why this behaviour? What can I do to parse that string correctly? And, if I wanted to add to array[i] each substring that ends with "\n\n" how can I do?

edit: Sorry I made a mistake with the example because I forgot that split() returns an array. The problem is that it returns an array with one line so what I'm asking is why doing this:

System.out.println(Arrays.toString(TextBuffer.toString().split("\\n")));

has no effect on the string I wrote in the example.

Also this:

    System.out.println( TextBuffer.toString().trim() );

has no effects on the original string, it just prints the original string.

I want to example the reason why I want to use the split(), it is because I want parse that string and put each substring that ends with "\n" in an array line, here is an example:

my originale string:

    (no hi virtual x oy)\n\n house cat \n open it \n\n trying to....

after parsing I would print each line of an array and the output should be:

line 1: (no hi virtual x oy)\
line 2: house cat
line 3: open it
line 4: trying to
and so on.....
6
  • 1
    I dont get your question, but split() returns an array, not a String. Try this: System.out.println(Arrays.toString(TextBuffer.toString().split("\\n"))); Commented Jun 6, 2013 at 18:09
  • 4
    Please don't name variables with a leading uppercase character, it's confusing to Java devs that follow standard Java naming conventions. Commented Jun 6, 2013 at 18:10
  • Are you sure there isn't an existing library which could do the dirty job for you? Commented Jun 6, 2013 at 18:12
  • What do you need, exactly? Commented Jun 6, 2013 at 18:17
  • try System.out.println(TextBuffer.toString().indexOf("\n")); When you joins string interesting thing can happen with the back-slashes... Commented Jun 6, 2013 at 18:47

1 Answer 1

1

If I understood your question correctly I would do something like this

String str = "(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....";

List<String> al = new ArrayList<String>(Arrays.asList(str.toString()
            .split("\\n")));

al.removeAll(Arrays.asList("", null)); // remove empty or null string

for (int i = 0; i< al.size(); i++) {
    System.out.println("Line " + i + " : " + al.get(i).trim());
}

Output

Line 0 : (no hi virtual x oy)
Line 1 : house cat
Line 2 : open it
Line 3 : trying to....
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.