Delete specific HTML tags in String

Question

I want to delete HTML tags(that are defined in an array) in a string.My approach:

public String cleanHTML(String unsafe,String[] blacklist){
   String safe = "";
   for(String s:blacklist){
      safe =unsafe.replaceAll("\\<.{0,1}"+s+".*?>", "");
   }

   return safe;}

To test my function I use the following main method:

public static void main(String a[]){
    StringParser sp = new StringParser();
    String[] blacklist = new String[]{"img","a"};

    System.out.println( sp.cleanHTML("<p class='p1'>paragraph</p><img></img>< this is not html > <A HREF='#'>Link</A><a link=''>another link</a> <![CDATA[<sender>John Doe</sender>]]>",blacklist));

}

Output:

<p class='p1'>paragraph</p><img></img>< this is not html > <A href='#'>Link</A> <![CDATA[<sender>John Doe</sender>]]>another link

As you can see it only replaces the "another link" part.So I basically have two questions:1.)how can I get my regex to replace every < a > regardless if its lower or upper case and 2.) how can I get my code to delete every blacklisted tag,not only the last one in the array?

Thanks in advance.

Better use a html parser than regex.

Fildor
– Fildor

2015-01-20 13:27:08 +00:00
Commented Jan 20, 2015 at 13:27 — Fildor
– Fildor, Commented Jan 20, 2015 at 13:27
unsafe.replaceAll does not modify unsafe.

molbdnilo
– molbdnilo

2015-01-20 13:31:20 +00:00
Commented Jan 20, 2015 at 13:31 — molbdnilo
– molbdnilo, Commented Jan 20, 2015 at 13:31
Oh,guess I should concentrate a bit more.Thank you.

teair
– teair

2015-01-20 13:40:51 +00:00
Commented Jan 20, 2015 at 13:40 — teair
– teair, Commented Jan 20, 2015 at 13:40

Thomas · Accepted Answer · 2015-01-20 13:47:28Z

4

1.)how can I get my regex to replace every < a > regardless if its lower or upper case

As already said by others, it would be best to use some HTML parser/cleaner since HTML doesn't fit regular expressions too well.

However, if you still want to use regular expressions and make some assumptions (e.g. the HTML is wellformed) you might want to use something like this expression:

(?i)</?(?:p|img|a).*?>

The expression is case-insensitive ((?i)) and .* would make the expression match as little as possible. However this would have problems if an attribute contained a closing bracket, e.g. <a href="whatever" title=">>>"> would not be matched correctly. You could try ans match pairs of quotation marks as well but as you can see the expression gets ever more complicated. That's one reason why regex don't fit HTML that well.

how can I get my code to delete every blacklisted tag,not only the last one in the array?

You need to operate on the intermediate result instead of on the initial parameter value:

String intermediate = unsafe;
for(String s:blacklist){
  intermediate = intermediate.replaceAll("\\<.{0,1}"+s+".*?>", "");
}
String safe = intermediate; //maybe do some additional checks here

Of course if there's a large blacklist, you might want to work on a StringBuffer instead.

Another option, as I already demonstrated above, might be to add all those tags as alternation options, i.e. (?:a|img|p|br) etc., but if that list becomes too big it might also decrease performance.

edited Jan 20, 2015 at 13:47

answered Jan 20, 2015 at 13:29

Thomas

88.9k13 gold badges126 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

teair Over a year ago

Thanks for your answer.Would a StringBuffer increase the performance when there are ~25-30 banned tags?

Thomas Over a year ago

@gruntswilldie you'd need to profile that but instead of creating 25 - 30 intermediate string objects you'd operate on a single StringBuffer which would at least save some memory, espially if the input string is somewhat larger.

teair Over a year ago

Guess I'm going to try my regex on a StringBuffer then,thank you.

Collectives™ on Stack Overflow

Delete specific HTML tags in String

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related