5

I'd like to replace all the tag-looking parts in a String if those are not valid HTML tags. A tag-looking part is something enclosed in <> brackets. Eg. <[email protected]> or <hello> but <br>, <div>, and so on has to be kept.

Do you have any idea how to achieve this?

Any help is appreciated!

cheers,

balázs

2
  • replace or remove? Please show expected output. Commented Jan 14, 2011 at 13:49
  • "one two three <blabla> four <text> five <div class="bold">six</div>" to "one two three four five <div class="bold">six</div>" - so replace to an empty String. Commented Jan 14, 2011 at 13:58

4 Answers 4

9

You can use JSoup to clean HTML.

String cleaned = Jsoup.clean(html, Whitelist.relaxed());

You can either use one of the defined Whitelists or you can create your own custom one in which you specify which HTML elements you wish to allow through the cleaner. Everything else is removed.


Your specific example would be:

String html = "one two three <blabla> four <text> five <div class=\"bold\">six</div>";
String cleaned = Jsoup.clean(html, Whitelist.relaxed().addAttributes("div", "class"));
System.out.println(cleaned);

Output:

one two three  four  five 
<div class="bold">
 six
</div>
Sign up to request clarification or add additional context in comments.

Comments

0

Have a look at the java.util.Scanner class - you can set a delimiter then see if the string matches HTML tag or not - you will have to build an Array of strings that should be ignored.

3 Comments

i did not want to build the Array by myself, i was rather looking for an already existing Enum, similar to download.oracle.com/javase/1.4.2/docs/api/javax/swing/text/html/…
Something similar to this post then -> stackoverflow.com/questions/240546/…
yes, i've also seen that. so my problem is similar except that I don't wanna strip the HTML tags but keep them.
0

You may also want to include ending tags in your comparison algorithm. So you may want to look for a forward slash(html end tag) and strip it before your comparison.

Comments

0

If you do it in order to display untrusted data on the web page, simple removing of invalid tags is not enough. Take a look at OWASP AntiSamy.

1 Comment

thanks for the hint, im gonna have a look at it, but this time i'd like them to remove simply. no more, no less.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.