0

How I can clean a string leaving only the plain text and the <a> elements?

Example:

<table><tr><td>Hello my web is <a href="http://www.myweb.com">Myweb</a>, <span>Follow my blog!</span></td></tr></table>

Results:

Hello my web is <a href="http://www.myweb.com">Myweb</a>, Follow my blog!

Thanks,

2
  • 2
    If you're trying to do this via RegEx (as per your tag) then remember this: Rule 1: don't use RegEx to parse HTML. Rule 2: if you still want to parse HTML with RegEx, see rule 1. RegEx can only match regular languages, and HTML is not a regular language Commented Apr 24, 2014 at 11:58
  • @freefaller looks like you got there with the "for the love of god, NO" advice before me. :) Commented Apr 24, 2014 at 12:01

3 Answers 3

3

VERY VERY hacky (and really shouldn't be used productionally) but:

C#

Regex.Replace(input, @"<[^>]+?\/?>", m => {
    // here you can exclude specific tags such as `<a>` or maybe `<b>`, etc.
    return Regex.IsMatch(m.Value, @"^<a\b|\/a>$") ? m.Value : String.Empty;
});

Basically, it just takes out every HTML code with the exception of <a ...>...</a>.

Note: this DOES NOT:

  • Validate if a tag was opened/closed/nested correctly.
  • Validate if the <> are actually HTML tags (maybe your input has < or > in the text itself?)
  • Handle "nested" <> tags. (e.g. <img src="http://placeholde.it/100" alt="foo<Bar>"/> will leave a remainder of "/> in the output string)

Here's the same thing turned in to a helper method:

// Mocks http://www.php.net/strip_tags

/// <summary>
/// Removed all HTML tags from the string and returned the purified result.
/// If supplied, tags matching <paramref name="allowedTags"/> will be left untouched.
/// </summary>
/// <param name="input">The input string.</param>
/// <param name="allowedTags">Tags to remain in the original input.</param>
/// <returns>Transformed input string.</returns>
static String StripTags(String input, params String[] allowedTags)
{
    if (String.IsNullOrEmpty(input)) return input;
    MatchEvaluator evaluator = m => String.Empty;
    if (allowedTags != null && allowedTags.Length > 0)
    {
        Regex reAllowed = new Regex(String.Format(@"^<(?:{0})\b|\/(?:{0})>$", String.Join("|", allowedTags.Select(x => Regex.Escape(x)).ToArray())));
        evaluator = m => reAllowed.IsMatch(m.Value) ? m.Value : String.Empty;
    }
    return Regex.Replace(input, @"<[^>]+?\/?>", evaluator);
}

// StripTags(input) -- all tags are removed
// StripTags(input, "a") -- all tags but <a> are removed
// StripTags(input, new[]{ "a" }) -- same as above
Sign up to request clarification or add additional context in comments.

1 Comment

better answer than mine.
2

This code will remove all tags but <a> tag.

        Regex r = new Regex(@"(?!</a>)(<\w+>|</\w+>)");
        var removedTags = r.Replace(inputString, "");

1 Comment

FYI you could compress that in (?!</?a>)</?\w+>. But your regex removes <a href> and I don't believe it should.
0

First off you can't use regex's to parse html

just do a global replace on something like </?table>|</?tr>|</?td> with any other tags you don't want and replace them with the empty string "".

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.