Parse string, using default methods

Question

I have used the following code to extract text from .odt files:

public class OpenOfficeParser {

StringBuffer TextBuffer;

public OpenOfficeParser() {}

//Process text elements recursively
public void processElement(Object o) {

    if (o instanceof Element) {

        Element e = (Element) o;
        String elementName = e.getQualifiedName();

        if (elementName.startsWith("text")) {

            if (elementName.equals("text:tab")) // add tab for text:tab
                TextBuffer.append("\\t");
            else if (elementName.equals("text:s"))  // add space for text:s
                TextBuffer.append(" ");
            else {
                List children = e.getContent();
                Iterator iterator = children.iterator();

                while (iterator.hasNext()) {

                    Object child = iterator.next();
                    //If Child is a Text Node, then append the text
                    if (child instanceof Text) { 
                        Text t = (Text) child;
                        TextBuffer.append(t.getValue());
                    }
                    else
                    processElement(child); // Recursively process the child element                   
                }                   
            }
            if (elementName.equals("text:p"))
                TextBuffer.append("\\n");                   
        }
        else {
            List non_text_list = e.getContent();
            Iterator it = non_text_list.iterator();
            while (it.hasNext()) {
                Object non_text_child = it.next();
                processElement(non_text_child);                   
            }
        }               
    }
}

public String getText(String fileName) throws Exception {
    TextBuffer = new StringBuffer();

    //Unzip the openOffice Document
    ZipFile zipFile = new ZipFile(fileName);
    Enumeration entries = zipFile.entries();
    ZipEntry entry;

    while(entries.hasMoreElements()) {
        entry = (ZipEntry) entries.nextElement();

        if (entry.getName().equals("content.xml")) {

            TextBuffer = new StringBuffer();               
            SAXBuilder sax = new SAXBuilder();
            Document doc = sax.build(zipFile.getInputStream(entry));
            Element rootElement = doc.getRootElement();
            processElement(rootElement);
            break;
        }
    }    


 System.out.println("The text extracted from the OpenOffice document = " + TextBuffer.toString());
        return TextBuffer.toString();       
    }     
}

now my problem occurs when using the returned string from getText() method. I ran the program and extracted some text from a .odt, here is a piece of extracted text:

(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....

So I tried this

System.out.println( TextBuffer.toString().split("\\n"));

the output I received was:

substring: [Ljava.lang.String;@505bb829

I also tried this:

System.out.println( TextBuffer.toString().trim() );

but no changes in the printed string.

Why this behaviour? What can I do to parse that string correctly? And, if I wanted to add to array[i] each substring that ends with "\n\n" how can I do?

edit: Sorry I made a mistake with the example because I forgot that split() returns an array. The problem is that it returns an array with one line so what I'm asking is why doing this:

System.out.println(Arrays.toString(TextBuffer.toString().split("\\n")));

has no effect on the string I wrote in the example.

Also this:

    System.out.println( TextBuffer.toString().trim() );

has no effects on the original string, it just prints the original string.

I want to example the reason why I want to use the split(), it is because I want parse that string and put each substring that ends with "\n" in an array line, here is an example:

my originale string:

    (no hi virtual x oy)\n\n house cat \n open it \n\n trying to....

after parsing I would print each line of an array and the output should be:

line 1: (no hi virtual x oy)\
line 2: house cat
line 3: open it
line 4: trying to
and so on.....

I dont get your question, but split() returns an array, not a String. Try this: System.out.println(Arrays.toString(TextBuffer.toString().split("\\n"))); — acdcjunior
– acdcjunior, Commented Jun 6, 2013 at 18:09
Please don't name variables with a leading uppercase character, it's confusing to Java devs that follow standard Java naming conventions. — Dave Newton
– Dave Newton, Commented Jun 6, 2013 at 18:10
Are you sure there isn't an existing library which could do the dirty job for you? — fge
– fge, Commented Jun 6, 2013 at 18:12
try System.out.println(TextBuffer.toString().indexOf("\n")); When you joins string interesting thing can happen with the back-slashes... — Balint Bako
– Balint Bako, Commented Jun 6, 2013 at 18:47

Smit · Accepted Answer · 2013-06-06 19:56:49Z

1

If I understood your question correctly I would do something like this

String str = "(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....";

List<String> al = new ArrayList<String>(Arrays.asList(str.toString()
            .split("\\n")));

al.removeAll(Arrays.asList("", null)); // remove empty or null string

for (int i = 0; i< al.size(); i++) {
    System.out.println("Line " + i + " : " + al.get(i).trim());
}

Output

Line 0 : (no hi virtual x oy)
Line 1 : house cat
Line 2 : open it
Line 3 : trying to....

edited Jun 6, 2013 at 19:56

answered Jun 6, 2013 at 19:49

Smit

4,7151 gold badge27 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parse string, using default methods

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related