How to parse text without nested html elements using Jericho?

Question

Using Jericho, I need to parse something like this:

<html>
<div class="title">
    Spoon bows
    <br/>
    <span>
        A Matrix scene.
        <br/>
        Matrix 1
    </span>
</div>
</html>

I want to parse "Spoon bows", but I get the whole content within the <div> tag using the following code:

List<Element> list = item.getAllElementsByClass("title");
if(list!=null) {
    Element title = list.get(0);
    if(title!=null) {
        String text = title.getContent().getTextExtractor().toString();
        }
    }
}

Sorry for the unformatted code snippet, I somehow can't get it right, though I use 4 spaces and such... — AndaluZ
– AndaluZ, Commented Mar 20, 2012 at 21:52
The text editor has a "code" formatter. It will automatically indent your code by 4 spaces. — Soviut
– Soviut, Commented Oct 25, 2012 at 6:32

heejong · Accepted Answer · 2012-11-13 07:07:58Z

6

This should help you:

private String getTextContent(Element elem) {
    String text = elem.getContent().toString();

    final List<Element> children = elem.getChildElements();
    for (Element child : children) {
        text = text.replace(child.toString(), "");
    }
    return text;
}

edited Nov 13, 2012 at 7:07

answered Nov 13, 2012 at 2:44

heejong

7947 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mene Over a year ago

This will break for some cases, e.g. <a>A text<b>A text</b></a>

Community · Accepted Answer · 2017-05-23 11:55:56Z

1

Maybe you could iterate over children elements of your title node.

Take a look at this question: How to iterate over plain text segments with the Jericho HTML parser

edited May 23, 2017 at 11:55

CommunityBot

11 silver badge

answered Oct 25, 2012 at 6:31

mkhelif

1,55910 silver badges18 bronze badges

Collectives™ on Stack Overflow

How to parse text without nested html elements using Jericho?

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related