How can i remove HTML Tags from String by REGEX?

Question

I am fetching data from Mysql but the issue is "HTML tags i.e.

&lt;p&gt;LARGE&lt;/p&gt;&lt;p&gt;Lamb;<br>;li;ul;&nbsp;

also being fetched with my data i just need "LARGE" and "Lamb" from above line. How can I separate/remove HTML tags from String?

Given that the html is invalid by virtual of being partially encoded, you really really REALLY don't want to try to come up with a regex that can handle this. Regexes to manipulate html are bad enough. Regexes that can manipulate BROKEN html are beyond the purvey of even gods like Alan Turing. — Marc B
– Marc B, Commented Jan 12, 2015 at 20:11
@xxbbcc  is no html, or not something that will build up to a DOM. It will become the literal text , without any meaning semantic to html. — CodeCaster
– CodeCaster, Commented Jan 12, 2015 at 20:19
@CodeCaster As I said, I'm aware of that - I didn't mis-read his sample. However, even there, he has   so the sample would be parsed into several nodes as it is. If he properly decodes the HTML before parsing, it becomes similar input with more HTML nodes in it. — xxbbcc
– xxbbcc, Commented Jan 12, 2015 at 20:27
For the OP's reference, here's why you cannot use regex: stackoverflow.com/a/1732454/682404 — xxbbcc
– xxbbcc, Commented Jan 12, 2015 at 20:35

Mitch · Accepted Answer · 2015-01-12 20:25:57Z

2

I am going to assume that the HTML is intact, perhaps something like the following:

<ul><li><p>LARGE</p><p>Lamb<br></li></ul>&nbsp;

In which case, I would use HtmlAgilityPack to get the content without having to resort to regex.

var html = "<ul><li><p>LARGE</p><p>Lamb</p><br></li></ul>&nbsp;";
var hap = new HtmlDocument();
hap.LoadHtml(html);

string text = HtmlEntity.DeEntitize(hap.DocumentNode.InnerText);
// text is now "LARGELamb "

string[] lines = hap.DocumentNode.SelectNodes("//text()")
    .Select(h => HtmlEntity.DeEntitize(h.InnerText)).ToArray();
// lines is { "LARGE", "Lamb", " " }

edited Jan 12, 2015 at 20:25

answered Jan 12, 2015 at 20:19

Mitch

22.6k8 gold badges70 silver badges99 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

naeemshah1 Over a year ago

but Mitch its can't remove span tag like i have Any Pizza i want to get only "Any Pizza" string

Mitch Over a year ago

@naeemshah1, did you try it? It worked as you asked, lines will be { "Any Pizza" } in that example.

mybirthname · Accepted Answer · 2015-01-12 20:28:05Z

1

If we assume that you are going to fix your html elements.

    static void Main(string[] args)
    {
        string html = WebUtility.HtmlDecode("&lt;p&gt;LARGE&lt;/p&gt;&lt;p&gt;Lamb&lt;/p&gt;");

        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        List<HtmlNode> spanNodes = doc.DocumentNode.Descendants().Where(x => x.Name == "p").ToList();

        foreach (HtmlNode node in spanNodes)
        {
            Console.WriteLine(node.InnerHtml);
        }

    }

You need to use HTML Agility Pack.You can add reference like this.:

Install-Package HtmlAgilityPack

edited Jan 12, 2015 at 20:28

answered Jan 12, 2015 at 20:21

mybirthname

18.2k3 gold badges35 silver badges59 bronze badges

4 Comments

naeemshah1 Over a year ago

PM> Install-Package HtmlAgilityPack Install-Package : File contains corrupted data. At line:1 char:16 + Install-Package <<<< HtmlAgilityPack + CategoryInfo : NotSpecified: (:) [Install-Package], FileFormatException + FullyQualifiedErrorId : NuGetCmdletUnhandledException,NuGet.PowerShell.Commands.InstallPackageCommand

mybirthname Over a year ago

@naeemshah1 here check this link. You can download it from their too: htmlagilitypack.codeplex.com

naeemshah1 Over a year ago

but msbirthname its can't remove span tag like i have Any Pizza i want to get only "Any Pizza" string

mybirthname Over a year ago

@naeemshah1 Policy of stack overflow is to ask one problem per question you asked how to take text I give you the answer. If you have another problem create new question and mark this as finish.

naeemshah1 · Accepted Answer · 2015-01-13 13:56:12Z

0

try this

// erase html tags from a string
public static string StripHtml(string target)
{
//Regular expression for html tags
Regex StripHTMLExpression = new Regex("<\\S[^><]*>", RegexOptions.IgnoreCase |   RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);

return StripHTMLExpression.Replace(target, string.Empty);
}

call

string htmlString="<div><span>hello world!</span></div>";
string strippedString=StripHtml(htmlString);

answered Jan 13, 2015 at 13:56

naeemshah1

1421 gold badge2 silver badges13 bronze badges

Comments

GMalla · Accepted Answer · 2015-01-12 20:46:04Z

Assuming that:

the original string is always going to be in that specific format, and that
you cannot add the HTMLAgilityPack,

here is a quick and dirty way of getting what you want:

    static void Main(string[] args)
    {
        // Split original string on the 'separator' string.
        string originalString = "&lt;p&gt;LARGE&lt;/p&gt;&lt;p&gt;Lamb;<br>;li;ul;&nbsp;";
        string[] sSeparator = new string[] { "&lt;/p&gt;&lt;p&gt;" };
        string[] splitString = originalString.Split(sSeparator, StringSplitOptions.None);

        // Prepare to filter the 'prefix' and 'postscript' strings
        string prefix = "&lt;p&gt;";
        string postfix = ";<br>;li;ul;&nbsp;";
        int prefixLength = prefix.Length;
        int postfixLength = postfix.Length;

        // Iterate over the split string and clean up
        string s = string.Empty;
        for (int i = 0; i < splitString.Length; i++)
        {
            s = splitString[i];
            if (s.Contains(prefix))
            {
                s = s.Remove(s.IndexOf(prefix), prefixLength);

            }
            if (s.Contains(postfix))
            {
                s = s.Remove(s.IndexOf(postfix), postfixLength);
            }

            splitString[i] = s;
            Console.WriteLine(splitString[i]);
        }

        Console.ReadLine();
    }

Eduardo Freitas · Accepted Answer · 2016-11-25 15:22:46Z

0

// Convert &lt; &gt; etc. to HTML
String sResult = HttpUtility.HtmlDecode(sData);
// Remove HTML tags delimited by <>
String result = Regex.Replace(sResult, @"enter code here<[^>]*>", String.Empty);

answered Nov 25, 2016 at 15:22

Eduardo Freitas

11 bronze badge

Collectives™ on Stack Overflow

How can i remove HTML Tags from String by REGEX?

5 Answers 5

2 Comments

4 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related