Remove html tags from a string except <a> in asp.net

Question

How I can clean a string leaving only the plain text and the <a> elements?

Example:

<table><tr><td>Hello my web is <a href="http://www.myweb.com">Myweb</a>, <span>Follow my blog!</span></td></tr></table>

Results:

Hello my web is <a href="http://www.myweb.com">Myweb</a>, Follow my blog!

Thanks,

If you're trying to do this via RegEx (as per your tag) then remember this: Rule 1: don't use RegEx to parse HTML. Rule 2: if you still want to parse HTML with RegEx, see rule 1. RegEx can only match regular languages, and HTML is not a regular language — freefaller
– freefaller, Commented Apr 24, 2014 at 11:58
@freefaller looks like you got there with the "for the love of god, NO" advice before me. :) — Mike H-R
– Mike H-R, Commented Apr 24, 2014 at 12:01

Brad Christie · Accepted Answer · 2014-04-24 12:24:22Z

VERY VERY hacky (and really shouldn't be used productionally) but:

C#

Regex.Replace(input, @"<[^>]+?\/?>", m => {
    // here you can exclude specific tags such as `<a>` or maybe `<b>`, etc.
    return Regex.IsMatch(m.Value, @"^<a\b|\/a>$") ? m.Value : String.Empty;
});

Basically, it just takes out every HTML code with the exception of <a ...>...</a>.

Note: this DOES NOT:

Validate if a tag was opened/closed/nested correctly.
Validate if the <> are actually HTML tags (maybe your input has < or > in the text itself?)
Handle "nested" <> tags. (e.g. <img src="http://placeholde.it/100" alt="foo<Bar>"/> will leave a remainder of "/> in the output string)

Here's the same thing turned in to a helper method:

// Mocks http://www.php.net/strip_tags

/// <summary>
/// Removed all HTML tags from the string and returned the purified result.
/// If supplied, tags matching <paramref name="allowedTags"/> will be left untouched.
/// </summary>
/// <param name="input">The input string.</param>
/// <param name="allowedTags">Tags to remain in the original input.</param>
/// <returns>Transformed input string.</returns>
static String StripTags(String input, params String[] allowedTags)
{
    if (String.IsNullOrEmpty(input)) return input;
    MatchEvaluator evaluator = m => String.Empty;
    if (allowedTags != null && allowedTags.Length > 0)
    {
        Regex reAllowed = new Regex(String.Format(@"^<(?:{0})\b|\/(?:{0})>$", String.Join("|", allowedTags.Select(x => Regex.Escape(x)).ToArray())));
        evaluator = m => reAllowed.IsMatch(m.Value) ? m.Value : String.Empty;
    }
    return Regex.Replace(input, @"<[^>]+?\/?>", evaluator);
}

// StripTags(input) -- all tags are removed
// StripTags(input, "a") -- all tags but <a> are removed
// StripTags(input, new[]{ "a" }) -- same as above

leskovar · Accepted Answer · 2014-04-24 12:01:30Z

2

This code will remove all tags but <a> tag.

        Regex r = new Regex(@"(?!</a>)(<\w+>|</\w+>)");
        var removedTags = r.Replace(inputString, "");

answered Apr 24, 2014 at 12:01

leskovar

6613 silver badges8 bronze badges

1 Comment

Robin Over a year ago

FYI you could compress that in (?!</?a>)</?\w+>. But your regex removes <a href> and I don't believe it should.

Community · Accepted Answer · 2017-05-23 12:11:13Z

0

First off you can't use regex's to parse html

just do a global replace on something like </?table>|</?tr>|</?td> with any other tags you don't want and replace them with the empty string "".

edited May 23, 2017 at 12:11

CommunityBot

11 silver badge

answered Apr 24, 2014 at 11:59

Mike H-R

7,8525 gold badges46 silver badges65 bronze badges

Collectives™ on Stack Overflow

Remove html tags from a string except <a> in asp.net

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related