Parsing html in Jsoup

Question

I am trying to parse html tags here using jsoup. I am new to jsoup. Basically I need to parse the tags and get the text inside those tags and apply the style mentioned in the class attribute.

I am creating a SpannableStringBuilder for that I can create substrings, apply styles and append them together with texts that have no styles.

String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";

SpannableStringBuilder text = new SpannableStringBuilder();
    if (value.contains("</span>")) {
        Document document = Jsoup.parse(value);
        Elements elements = document.getElementsByTag("span");
        if (elements != null) {
            int i = 0;
            int start = 0;
            for (Element ele : elements) {
                String styleName =  type + "." + ele.attr("class");
                text.append(ele.text());
                int style = context.getResources().getIdentifier(styleName, "style", context.getPackageName());
                text.setSpan(new TextAppearanceSpan(context, style), start, text.length(), Spannable.SPAN_EXCLUSIVE_EXCLUSIVE);
                text.append(ele.nextSibling().toString());
                start = text.length();
                i++;
            }
        }
        return text;
    }

I am not sure how I can parse the strings that are not between any tags such as the "There are" and "worker from the".

Need output such as:

- There are
- <span class='newStyle'> two </span>
- workers from the
- <span class='oldStyle'>Front of House</span>

@KrystianG: thanks for that. From a node how can I get the text stripping the html , like the text "two" ? — user2234
– user2234, Commented Jan 10, 2020 at 0:04

Krystian G · Accepted Answer · 2020-01-10 12:27:43Z

1

Full answer: you can get the text outside of the tags by getting childNodes(). This way you obtain List<Node>. Note I'm selecting body because your HTML fragment doesn't have any parent element and parsing HTML fragment with jsoup adds <html> and <body> automatically.
If Node contains only text it's of type TextNode and you can get the content using toString().
Otherwise you can cast it to Element and get the text usingelement.text().

    String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
    Document doc = Jsoup.parse(str);
    Element body = doc.selectFirst("body");
    List<Node> childNodes = body.childNodes();
    for (int i = 0; i < childNodes.size(); i++) {
        Node node = body.childNodes().get(i);
        if (node instanceof TextNode) {
            System.out.println(i + " -> " + node.toString());
        } else {
            Element element = (Element) node;
            System.out.println(i + " -> " + element.text());
        }
    }

output:

0 -> 
There are 
1 -> two
2 ->  workers from the 
3 -> Front of House

By the way: I don't know how to get rid of the first line break before There are.

edited Jan 10, 2020 at 12:27

answered Jan 10, 2020 at 12:17

Krystian G

2,9413 gold badges13 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user2234 Over a year ago

thanks, I used Parser.xmlParser to avoid the tags Jsoup adds.

Collectives™ on Stack Overflow

Parsing html in Jsoup

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related