1

Using Jericho, I need to parse something like this:

<html>
<div class="title">
    Spoon bows
    <br/>
    <span>
        A Matrix scene.
        <br/>
        Matrix 1
    </span>
</div>
</html>

I want to parse "Spoon bows", but I get the whole content within the <div> tag using the following code:

List<Element> list = item.getAllElementsByClass("title");
if(list!=null) {
    Element title = list.get(0);
    if(title!=null) {
        String text = title.getContent().getTextExtractor().toString();
        }
    }
}
2
  • Sorry for the unformatted code snippet, I somehow can't get it right, though I use 4 spaces and such... Commented Mar 20, 2012 at 21:52
  • The text editor has a "code" formatter. It will automatically indent your code by 4 spaces. Commented Oct 25, 2012 at 6:32

2 Answers 2

6

This should help you:

private String getTextContent(Element elem) {
    String text = elem.getContent().toString();

    final List<Element> children = elem.getChildElements();
    for (Element child : children) {
        text = text.replace(child.toString(), "");
    }
    return text;
}
Sign up to request clarification or add additional context in comments.

1 Comment

This will break for some cases, e.g. <a>A text<b>A text</b></a>
1

Maybe you could iterate over children elements of your title node.

Take a look at this question: How to iterate over plain text segments with the Jericho HTML parser

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.