Regular expression to remove HTML tags

Question

I am using the following Regular Expresion to remove html tags from a string. It works except I leave the closing tag. If I attempt to remove: <a href="blah">blah</a> it leaves the <a/>.

I do not know Regular Expression syntax at all and fumbled through this. Can someone with RegEx knowledge please provide me with a pattern that will work.

Here is my code:

  string sPattern = @"<\/?!?(img|a)[^>]*>";
  Regex rgx = new Regex(sPattern);
  Match m = rgx.Match(sSummary);
  string sResult = "";
  if (m.Success)
   sResult = rgx.Replace(sSummary, "", 1);

I am looking to remove the first occurence of the <a> and <img> tags.

"I am using ... Regular Expresion to remove html tags" there's your problem. Use an HTML parser instead. — Welbog
– Welbog, Commented Sep 24, 2010 at 20:24
possible duplicate of RegEx match open tags except XHTML self-contained tags -- in spite of the title, this is an exact dupe. Promise. — egrunin
– egrunin, Commented Sep 24, 2010 at 20:25
Since other people can't see the possible use-case for this, here's mine... a) working within a code sandbox (Salesforce) where it is difficult, if not impossible, to include and maintain a 3rd-party library b) only trying to strip tags out of an email body for a cleaner email-to-case description (i.e. - no security issues involved) c) the stripHtmlTags() method did not do a sufficient job of removing the extra tags — Ixalmida
– Ixalmida, Commented Aug 2, 2018 at 16:02

Johs · Accepted Answer · 2014-06-23 12:30:59Z

29

To turn this:

'<td>mamma</td><td><strong>papa</strong></td>'

into this:

'mamma papa'

You need to replace the tags with spaces:

.replace(/<[^>]*>/g, ' ')

and reduce any duplicate spaces into single spaces:

.replace(/\s{2,}/g, ' ')

then trim away leading and trailing spaces with:

.trim();

Meaning that your remove tag function look like this:

function removeTags(string){
  return string.replace(/<[^>]*>/g, ' ')
               .replace(/\s{2,}/g, ' ')
               .trim();
}

edited Jun 23, 2014 at 12:30

answered Jun 22, 2014 at 10:27

Johs

4614 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user280109 Over a year ago

this is a great answer, how would you change it, if you wanted to strip out all tags including the text content of the tags? just leaving behind text that was not inside tags?

user280109 Over a year ago

ahhh i figured it out, i came up with: function removeTags(string){ return string.replace(/<[^>]*>.*?(<[^>]*>)?/g, ' ') .replace(/\s{2,}/g, ' ') .trim(); }

Claudia Over a year ago

This is trivially broken, and should never be used for any reason. If you really want to sanitize HTML, use something that's actually aware of the HTML grammar. Try it against this input, which loads a 1px GIF, then assuming jQuery is present, loads a script:

<img src="data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACwAAAAAA‌QABAAACAkQBADs=" onload="$.getScript('evil.js');1<2>3">

. It won't correctly remove that element, even though it is supposed to.

Johs Over a year ago

Isiah, the "any reason" part of your warning does not quite fit this question. To use regexp to remove somthing from a string sounded to me like a very controlled environment. If the task was to do content scraping or somthing more dynamic like your example suggests, I agree that the solution is not some impro regexp. Here's an article about a "bazillion different packages" for a start.

JaredPar · Accepted Answer · 2010-09-24 21:05:57Z

28

Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.

Here is a link to a blog post I wrote awhile back which goes into more details about this problem.

http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.

var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) { 
  sResult = m.Groups["content"].Value;

edited Sep 24, 2010 at 21:05

answered Sep 24, 2010 at 20:26

JaredPar

759k152 gold badges1.3k silver badges1.5k bronze badges

2 Comments

LilMoke Over a year ago

Jared, this seems to throw an exception when I try it. Also, will this remove the text between the tags? I essentially want to remove the first occurence of the a, p and img tags from the string.

JaredPar Over a year ago

@Tony, fixed a bug in the regex. Should compile now

Vadim Tofan · Accepted Answer · 2014-12-09 19:29:20Z

In order to remove also spaces between tags, you can use the following method a combination between regex and a trim for spaces at start and end of the input html:

    public static string StripHtml(string inputHTML)
    {
        const string HTML_MARKUP_REGEX_PATTERN = @"<[^>]+>\s+(?=<)|<[^>]+>";
        inputHTML = WebUtility.HtmlDecode(inputHTML).Trim();

        string noHTML = Regex.Replace(inputHTML, HTML_MARKUP_REGEX_PATTERN, string.Empty);

        return noHTML;
    }

So for the following input:

      <p>     <strong>  <em><span style="text-decoration:underline;background-color:#cc6600;"></span><span style="text-decoration:underline;background-color:#cc6600;color:#663333;"><del>   test text  </del></span></em></strong></p><p><strong><span style="background-color:#999900;"> test 1 </span></strong></p><p><strong><em><span style="background-color:#333366;"> test 2 </span></em></strong></p><p><strong><em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p>

The output will be only the text without spaces between html tags or space before or after html: " test text test 1 test 2 test 3 ".

Please notice that the spaces before test text are from the <del> test text </del> html and the space after test 3 is from the  test 3  html.

Community · Accepted Answer · 2020-06-20 09:12:55Z

5

Strip off HTML Elements (with/without attributes)

/<\/?[\w\s]*>|<.+[\W]>/g

This will strip off all HTML elements and leave behind the text. This works well even for malformed HTML elements (i.e. elements that are missing closing tags)

Reference and example (Ex.10)

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jul 4, 2018 at 16:27

Niket Pathak

6,9202 gold badges43 silver badges53 bronze badges

8 Comments

GaryP Over a year ago

This helped me. How about if I need to search/replace specific tag? Like span?

Niket Pathak Over a year ago

For targeting span tags, you could either modify the accepted answer to suit your needs or use /<\/?[span]*>|<.+[\W]>/g

GaryP Over a year ago

The regex also matches  tags.

Niket Pathak Over a year ago

oh yes, you are right. There is a small error in the regex given in the comments, which will also cause it to match all <a> tags. To rectify, all you need to do is, remove the square brackets around the word span. i.e. /<\/?span*>|<.+[\W]>/g

Mitar Over a year ago

This fails on

<h2 class="ofscreen">Webontwikkeling leren</h2><h1>Regular Expressions</h1><p>"Alle onderdelen van MDN (documenten en de website zelf) worden gemaakt door een open gemeenschap."</p><br/>

, it selects the whole string.

|

Tran Anh Hien · Accepted Answer · 2016-12-30 02:47:04Z

4

can use:

Regex.Replace(source, "<[^>]*>", string.Empty);

answered Dec 30, 2016 at 2:47

Tran Anh Hien

7859 silver badges13 bronze badges

Comments

Dave Clemmer · Accepted Answer · 2013-04-28 19:28:25Z

3

So the HTML parser everyone's talking about is Html Agility Pack.

If it is clean XHTML, you can also use System.Xml.Linq.XDocument or System.Xml.XmlDocument.

edited Apr 28, 2013 at 19:28

Dave Clemmer

3,77312 gold badges53 silver badges72 bronze badges

answered Sep 24, 2010 at 20:36

Rei Miyasaka

7,1366 gold badges45 silver badges73 bronze badges

Comments

MasterPiece · Accepted Answer · 2018-12-31 11:26:05Z

If you need to find only the opening tags you can use the following regex, which will capture the tag type as $1 (a or img) and the content (including closing tag if there is one) as $2:

(?:<(a|img)(?:\s[^>]*)?>)((?:(?!<\1)[\s\S])*)

In case you have also closing tag you should use the following regex, which will capture the tag type as $1 (a or img) and the content as $2:

(?:<(a|img)(?:\s[^>]*)?>)\s*((?:(?!<\1)[\s\S])*)\s*(?:<\/\1>)

Basically you just need to use replace function on one of above regex, and return $2 in order to get what you wanted.

Short explanation about the query:

( ) - is used for capturing whatever matches the regex inside the brackets. The order of the capturing is the order of: $1, $2 etc.
?: - is used after an opening bracket "(" for not capturing the content inside the brackets.
\1 - is copying capture number 1, which is the tag type. I had to capture the tag type so closing tag will be consistent to the opening one and not something like: <img src=""> </a>.
\s - is white space, so after opening tag <img there will be at least 1 white space in case there are attributes (so it won't match <imgs> for example).
[^>]* - is looking for anything but the chars inside, which in this case is >, and * means for unlimited times.
?! - is looking for anything but the string inside, kinda similar to [^>] just for string instead of single chars.
[\s\S] - is used almost like . but allow any whitespaces (which will also match in case there are new lines between the tags). If you are using regex "s" flag, then you can use . instead.

Example of using with closing tag: https://regex101.com/r/MGmzrh/1

Example of using without closing tag: https://regex101.com/r/MGmzrh/2

Regex101 also has some explanation for what i did :)

A_Var · Accepted Answer · 2010-09-24 20:40:14Z

2

You can use already existing libraries to strip off the html tags. One good one being Chilkat C# Library.

answered Sep 24, 2010 at 20:40

A_Var

1,0361 gold badge13 silver badges23 bronze badges

1 Comment

LilMoke Over a year ago

This is all well and good, but I not onlyneed to remove the tag, I need to remove everything between the tags.

nurealam siddiq · Accepted Answer · 2019-08-30 20:16:55Z

2

This piece of code could help you out easily removing any html tags:

import re
string = str(<a href="blah">blah</a>)
replaced_string = re.sub('<a.*href="blah">.*<\/a>','',string) // remember, sub takes 3 arguments.

Output is an empty string.

answered Aug 30, 2019 at 20:16

nurealam siddiq

1,62112 silver badges10 bronze badges

Comments

Seph Reed · Accepted Answer · 2021-03-08 05:03:36Z

If all you're trying to do is remove the tags (and not figure out where the closing tag is), I'm really not sure why people are so fraught about it.

This Regex seems to handle anything I can throw at it:

<([\w\-/]+)( +[\w\-]+(=(('[^']*')|("[^"]*")))?)* *>

To break it down:

<([\w\-/]+) - match the beginning of the opening or closing tag. if you want to handle invalid stuff, you can add more here
( +[\w\-]+(=(('[^']*')|("[^"]*")))?)* - this bit matches attributes [0, N] times (* at then end)
- +[\w\-]+ - is space(s) followed by an attribute name
- (=(('[^']*')|("[^"]*")))? - not all attributes have assignment (?)
  - ('[^']*')|("[^"]*") - of the attributes that do have assignment, the value is a string with either single or double quotes. It's not allowed to skip over a closing quote to make things work
*> - the whole thing ends with any number of spaces, then the closing bracket

Obviously this will mess up if someone throws super invalid html at it, but it works for anything valid I've come up with yet. Test it out here:

const regex = /<([\w\-/]+)( +[\w\-]+(=(('[^']*')|("[^"]*")))?)* *>/g;

const byId = (id) => document.getElementById(id);

function replace() {
console.log(byId("In").value)
  byId("Out").innerText = byId("In").value.replace(regex, "CUT");
}

Write your html here: <br>
<textarea id="In" rows="8" cols="50"></textarea><br>
<button onclick="replace()">Replace all tags with "CUT"</button><br>
<br>
Output:
<div id="Out"></div>

Mayank Gupta · Accepted Answer · 2015-04-02 06:52:00Z

1

Remove image from the string, using a regular expression in c# (image search performed by image id)

string PRQ=<td valign=\"top\" style=\"width: 400px;\" align=\"left\"><img id=\"llgo\" src=\"http://test.Logo.png\" alt=\"logo\"></td>

var regex = new Regex("(<img(.+?)id=\"llgo\"(.+?))src=\"([^\"]+)\"");

PRQ = regex.Replace(PRQ, match => match.Groups[1].Value + "");

edited Apr 2, 2015 at 6:52

answered Apr 2, 2015 at 6:26

Mayank Gupta

891 silver badge5 bronze badges

Comments

fatnlazycat · Accepted Answer · 2017-08-16 13:36:20Z

1

Why not trying reluctant quantifier? htmlString.replaceAll("<\\S*?>", "")

(It's Java but the main thing is to show the idea)

answered Aug 16, 2017 at 13:36

fatnlazycat

498 bronze badges

Comments

Rakesh Chaudhari · Accepted Answer · 2018-09-14 13:36:46Z

1

Simple way,

String html = "<a>Rakes</a> <p>paroladasdsadsa</p> My Name Rakes";

html = html.replaceAll("(<[\\w]+>)(.+?)(</[\\w]+>)", "$2");

System.out.println(html);

answered Sep 14, 2018 at 13:36

Rakesh Chaudhari

3,5461 gold badge32 silver badges26 bronze badges

Comments

Wiktor Stribiżew · Accepted Answer · 2019-07-08 14:13:42Z

1

Here is the extension method I've been using for quite some time.

public static class StringExtensions
{
     public static string StripHTML(this string htmlString, string htmlPlaceHolder) {
         const string pattern = @"<.*?>";
         string sOut = Regex.Replace(htmlString, pattern, htmlPlaceHolder, RegexOptions.Singleline);
         sOut = sOut.Replace("&nbsp;", String.Empty);
         sOut = sOut.Replace("&amp;", "&");
         sOut = sOut.Replace("&gt;", ">");
         sOut = sOut.Replace("&lt;", "<");
         return sOut;
     }
}

edited Jul 8, 2019 at 14:13

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

answered Oct 17, 2013 at 16:31

ShawnCamp

413 bronze badges

Comments

Mohammad Farhadi · Accepted Answer · 2023-07-26 12:09:12Z

My friends, I used these patterns and solved my problem with any tags.

🚩 Be cautious, it's not recommended to use with nested HTML tags:

Regular way:

const str = "<h1>You are awesome!</h1>";
const nestedStr = `<p class="wrapper"><span class="you">You </span><h1 id="awesome">are awesome!</h1></p>`;

console.log("Original --> " + str);
console.log("Replaced version --> " + str.replace(/(<([^>]+)>)/gi, ""));

console.log("---------------------------------------------------");

console.log("Original Nested --> " + nestedStr);
console.log("Replaced Nested version --> " + nestedStr.replace(/(<([^>]+)>)/gi, ""));

The new and safe way:

const str = "<h1>You are awesome!</h1>";
const nestedStr = `<p class="wrapper"><span class="you">You </span><h1 id="awesome">are awesome!</h1></p>`;
    
const betterClearHTMLTags = (strToSanitize) => {
   let myHTML = new DOMParser().parseFromString(strToSanitize, 'text/html');
   return myHTML.body.textContent || '';
}

console.log("Original --> " + str);
console.log("Replaced version --> " + betterClearHTMLTags(str));

console.log("---------------------------------------------------");

console.log("Original Nested --> " + nestedStr);
console.log("Replaced Nested version --> " + betterClearHTMLTags(nestedStr));

The main article: dev.to/alvisonhunter

sln · Accepted Answer · 2025-01-28 23:15:42Z

1

Regex to remove ALL Html
No Atomic, no Assert

https://regex101.com/r/YOYd6R/1

<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>

answered Jan 28 at 23:15

sln

3,6431 gold badge7 silver badges13 bronze badges

Comments

Breakskater · Accepted Answer · 2010-09-24 20:38:18Z

0

Here's an extension method I created using a simple regular expression to remove HTML tags from a string:

/// <summary>
/// Converts an Html string to plain text, and replaces all br tags with line breaks.
/// </summary>
/// <returns></returns>
/// <remarks></remarks>
[Extension()]
public string ToPlainText(string s)
{

    s = s.Replace("<br>", Constants.vbCrLf);
    s = s.Replace("<br />", Constants.vbCrLf);
    s = s.Replace("<br/>", Constants.vbCrLf);


    s = Regex.Replace(s, "<[^>]*>", string.Empty);


    return s;
}

Hope that helps.

answered Sep 24, 2010 at 20:38

Breakskater

4351 gold badge4 silver badges18 bronze badges

8 Comments

Rei Miyasaka Over a year ago

There's more than just that ends with a slash; in fact, technically, any element can end with a slash -- and it might not necessarily be with one or no spaces following it or trailing it. This is also valid: 

Breakskater Over a year ago

Those lines are just there to preserve line breaks, if needed. Otherwise, they may be removed.

Julien Roncaglia Over a year ago

Nice, where are you using this ? on a public web site ? entering '<script src="evil.com/evil.js" ' (notice no ">" character) is enough to exploit it :D

Breakskater Over a year ago

Rei, it will remove You haven't even tested it

Breakskater Over a year ago

VirtualBlackFox, yes I am using it on a Public web site, and quite effectively. '<script src="evil.com/evil.js" ' is malformed and will not run, so that is a moot point.

|

DevWL · Accepted Answer · 2021-09-09 10:03:02Z

0

Select everything except from whats in there:

(?:<span.*?>|<\/span>|<p.*?>|<\/p>)

edited Sep 9, 2021 at 10:03

answered Sep 9, 2021 at 7:16

DevWL

19k6 gold badges98 silver badges92 bronze badges

Collectives™ on Stack Overflow

Regular expression to remove HTML tags

18 Answers 18

4 Comments

2 Comments

Comments

Strip off HTML Elements (with/without attributes)

8 Comments

Comments

Comments

Comments

1 Comment

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

8 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

18 Answers 18

4 Comments

2 Comments

Comments

Strip off HTML Elements (with/without attributes)

8 Comments

Comments

Comments

Comments

1 Comment

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related