I have used the following code to extract text from .odt files:
public class OpenOfficeParser {
StringBuffer TextBuffer;
public OpenOfficeParser() {}
//Process text elements recursively
public void processElement(Object o) {
if (o instanceof Element) {
Element e = (Element) o;
String elementName = e.getQualifiedName();
if (elementName.startsWith("text")) {
if (elementName.equals("text:tab")) // add tab for text:tab
TextBuffer.append("\\t");
else if (elementName.equals("text:s")) // add space for text:s
TextBuffer.append(" ");
else {
List children = e.getContent();
Iterator iterator = children.iterator();
while (iterator.hasNext()) {
Object child = iterator.next();
//If Child is a Text Node, then append the text
if (child instanceof Text) {
Text t = (Text) child;
TextBuffer.append(t.getValue());
}
else
processElement(child); // Recursively process the child element
}
}
if (elementName.equals("text:p"))
TextBuffer.append("\\n");
}
else {
List non_text_list = e.getContent();
Iterator it = non_text_list.iterator();
while (it.hasNext()) {
Object non_text_child = it.next();
processElement(non_text_child);
}
}
}
}
public String getText(String fileName) throws Exception {
TextBuffer = new StringBuffer();
//Unzip the openOffice Document
ZipFile zipFile = new ZipFile(fileName);
Enumeration entries = zipFile.entries();
ZipEntry entry;
while(entries.hasMoreElements()) {
entry = (ZipEntry) entries.nextElement();
if (entry.getName().equals("content.xml")) {
TextBuffer = new StringBuffer();
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build(zipFile.getInputStream(entry));
Element rootElement = doc.getRootElement();
processElement(rootElement);
break;
}
}
System.out.println("The text extracted from the OpenOffice document = " + TextBuffer.toString());
return TextBuffer.toString();
}
}
now my problem occurs when using the returned string from getText() method.
I ran the program and extracted some text from a .odt, here is a piece of extracted text:
(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....
So I tried this
System.out.println( TextBuffer.toString().split("\\n"));
the output I received was:
substring: [Ljava.lang.String;@505bb829
I also tried this:
System.out.println( TextBuffer.toString().trim() );
but no changes in the printed string.
Why this behaviour? What can I do to parse that string correctly? And, if I wanted to add to array[i] each substring that ends with "\n\n" how can I do?
edit:
Sorry I made a mistake with the example because I forgot that split() returns an array.
The problem is that it returns an array with one line so what I'm asking is why doing this:
System.out.println(Arrays.toString(TextBuffer.toString().split("\\n")));
has no effect on the string I wrote in the example.
Also this:
System.out.println( TextBuffer.toString().trim() );
has no effects on the original string, it just prints the original string.
I want to example the reason why I want to use the split(), it is because I want parse that string and put each substring that ends with "\n" in an array line, here is an example:
my originale string:
(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....
after parsing I would print each line of an array and the output should be:
line 1: (no hi virtual x oy)\
line 2: house cat
line 3: open it
line 4: trying to
and so on.....
split()returns an array, not aString. Try this:System.out.println(Arrays.toString(TextBuffer.toString().split("\\n")));System.out.println(TextBuffer.toString().indexOf("\n"));When you joins string interesting thing can happen with the back-slashes...