0

I am trying to parse an html input using jsoup (v1.18.1), extract elements, extract each attribute value and replace as follows:

  • > with &gt

  • < with &lt

The method I'm feeding this code into cannot have these symbols inside attribute values.

Below is the code I'm using

        Elements elements = htmlDocument.getAllElements();

        // Process each element's attributes
        for (Element element : elements) {
            // Iterate over all attributes of the element
            for (Attribute attribute : element.attributes()) {
                String originalValue = attribute.getValue();
                // Escape only '>' and '<' characters
                String escapedValue = escapeSpecificHtmlChars(originalValue);
                // Update the attribute with the escaped value
                element.attr(attribute.getKey(), escapedValue);
            }
        }
    private String escapeSpecificHtmlChars(final String input) {
        if (StringUtils.isBlank(input)) {
            return input;
        }
        // Replace only '>' with '&gt;' and '<' with '&lt;'
        return input.replace(">", "&gt;")
                .replace("<", "&lt;");
    }

Let's say the element is <span role="text" aria-label=">test value>">Test value!</span>

Attribute aria-label has the value ">test value>"

escapedValue would be "&gt;test value&gt;"

But when I set the element using element.attr(attribute.getKey(), escapedValue);, the attribute value becomes "&amp;gt;test value&amp;gt;"

I want the escapedValue to stay as is when I set it as the attribute value.

Any help would be appreciated!

5
  • 1
    Are you sure your replacement is needed? Looks like it made automatically when you set attribute. Commented Sep 12, 2024 at 20:27
  • 1
    @talex Yes the replacement is needed. As mentioned, the resulting Document is being fed into another method which errors out if '<' or '>' symbols are present inside attributes. Commented Sep 12, 2024 at 20:31
  • 2
    Please show the document you are parsing with jsoup, then show what you get as parsed content. And/or show what you construct as document and what jsoup would send to the next parser. I second @latex and believe you are doing too much. Commented Sep 12, 2024 at 20:58
  • What format is the input of the next process expecting? Commented Sep 12, 2024 at 23:29
  • I think you misunderstood talex’s question. We all agree the characters need to be escaped. The question is: do you need to escape them? It is JSoup’s job, not yours, to make sure element content is escaped in the underlying HTML document. Commented Sep 13, 2024 at 13:32

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.