0

I want to delete HTML tags(that are defined in an array) in a string.My approach:

public String cleanHTML(String unsafe,String[] blacklist){
   String safe = "";
   for(String s:blacklist){
      safe =unsafe.replaceAll("\\<.{0,1}"+s+".*?>", "");
   }

   return safe;}

To test my function I use the following main method:

public static void main(String a[]){
    StringParser sp = new StringParser();
    String[] blacklist = new String[]{"img","a"};

    System.out.println( sp.cleanHTML("<p class='p1'>paragraph</p><img></img>< this is not html > <A HREF='#'>Link</A><a link=''>another link</a> <![CDATA[<sender>John Doe</sender>]]>",blacklist));

}

Output:

<p class='p1'>paragraph</p><img></img>< this is not html > <A href='#'>Link</A> <![CDATA[<sender>John Doe</sender>]]>another link

As you can see it only replaces the "another link" part.So I basically have two questions:1.)how can I get my regex to replace every < a > regardless if its lower or upper case and 2.) how can I get my code to delete every blacklisted tag,not only the last one in the array?

Thanks in advance.

3
  • Better use a html parser than regex. Commented Jan 20, 2015 at 13:27
  • 1
    unsafe.replaceAll does not modify unsafe. Commented Jan 20, 2015 at 13:31
  • Oh,guess I should concentrate a bit more.Thank you. Commented Jan 20, 2015 at 13:40

1 Answer 1

4

1.)how can I get my regex to replace every < a > regardless if its lower or upper case

As already said by others, it would be best to use some HTML parser/cleaner since HTML doesn't fit regular expressions too well.

However, if you still want to use regular expressions and make some assumptions (e.g. the HTML is wellformed) you might want to use something like this expression:

(?i)</?(?:p|img|a).*?>

The expression is case-insensitive ((?i)) and .* would make the expression match as little as possible. However this would have problems if an attribute contained a closing bracket, e.g. <a href="whatever" title=">>>"> would not be matched correctly. You could try ans match pairs of quotation marks as well but as you can see the expression gets ever more complicated. That's one reason why regex don't fit HTML that well.

how can I get my code to delete every blacklisted tag,not only the last one in the array?

You need to operate on the intermediate result instead of on the initial parameter value:

String intermediate = unsafe;
for(String s:blacklist){
  intermediate = intermediate.replaceAll("\\<.{0,1}"+s+".*?>", "");
}
String safe = intermediate; //maybe do some additional checks here

Of course if there's a large blacklist, you might want to work on a StringBuffer instead.

Another option, as I already demonstrated above, might be to add all those tags as alternation options, i.e. (?:a|img|p|br) etc., but if that list becomes too big it might also decrease performance.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your answer.Would a StringBuffer increase the performance when there are ~25-30 banned tags?
@gruntswilldie you'd need to profile that but instead of creating 25 - 30 intermediate string objects you'd operate on a single StringBuffer which would at least save some memory, espially if the input string is somewhat larger.
Guess I'm going to try my regex on a StringBuffer then,thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.