1

I am fetching data from Mysql but the issue is "HTML tags i.e.

&lt;p&gt;LARGE&lt;/p&gt;&lt;p&gt;Lamb;<br>;li;ul;&nbsp;

also being fetched with my data i just need "LARGE" and "Lamb" from above line. How can I separate/remove HTML tags from String?

14
  • 7
    Given that the html is invalid by virtual of being partially encoded, you really really REALLY don't want to try to come up with a regex that can handle this. Regexes to manipulate html are bad enough. Regexes that can manipulate BROKEN html are beyond the purvey of even gods like Alan Turing. Commented Jan 12, 2015 at 20:11
  • 2
    Good luck. Your source data is screwed. Start again. Commented Jan 12, 2015 at 20:15
  • 1
    @xxbbcc &lt;p&gt; is no html, or not something that will build up to a DOM. It will become the literal text <p>, without any meaning semantic to html. Commented Jan 12, 2015 at 20:19
  • 2
    @CodeCaster As I said, I'm aware of that - I didn't mis-read his sample. However, even there, he has <br> so the sample would be parsed into several nodes as it is. If he properly decodes the HTML before parsing, it becomes similar input with more HTML nodes in it. Commented Jan 12, 2015 at 20:27
  • 1
    For the OP's reference, here's why you cannot use regex: stackoverflow.com/a/1732454/682404 Commented Jan 12, 2015 at 20:35

5 Answers 5

2

I am going to assume that the HTML is intact, perhaps something like the following:

<ul><li><p>LARGE</p><p>Lamb<br></li></ul>&nbsp;

In which case, I would use HtmlAgilityPack to get the content without having to resort to regex.

var html = "<ul><li><p>LARGE</p><p>Lamb</p><br></li></ul>&nbsp;";
var hap = new HtmlDocument();
hap.LoadHtml(html);

string text = HtmlEntity.DeEntitize(hap.DocumentNode.InnerText);
// text is now "LARGELamb "

string[] lines = hap.DocumentNode.SelectNodes("//text()")
    .Select(h => HtmlEntity.DeEntitize(h.InnerText)).ToArray();
// lines is { "LARGE", "Lamb", " " }
Sign up to request clarification or add additional context in comments.

2 Comments

but Mitch its can't remove span tag like i have <span style=line-height: 1.6em;>Any Pizza</span> i want to get only "Any Pizza" string
@naeemshah1, did you try it? It worked as you asked, lines will be { "Any Pizza" } in that example.
1

If we assume that you are going to fix your html elements.

    static void Main(string[] args)
    {
        string html = WebUtility.HtmlDecode("&lt;p&gt;LARGE&lt;/p&gt;&lt;p&gt;Lamb&lt;/p&gt;");

        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        List<HtmlNode> spanNodes = doc.DocumentNode.Descendants().Where(x => x.Name == "p").ToList();

        foreach (HtmlNode node in spanNodes)
        {
            Console.WriteLine(node.InnerHtml);
        }

    }

You need to use HTML Agility Pack.You can add reference like this.:

Install-Package HtmlAgilityPack  

4 Comments

PM> Install-Package HtmlAgilityPack Install-Package : File contains corrupted data. At line:1 char:16 + Install-Package <<<< HtmlAgilityPack + CategoryInfo : NotSpecified: (:) [Install-Package], FileFormatException + FullyQualifiedErrorId : NuGetCmdletUnhandledException,NuGet.PowerShell.Commands.InstallPackageCommand
@naeemshah1 here check this link. You can download it from their too: htmlagilitypack.codeplex.com
but msbirthname its can't remove span tag like i have <span style=line-height: 1.6em;>Any Pizza</span> i want to get only "Any Pizza" string
@naeemshah1 Policy of stack overflow is to ask one problem per question you asked how to take <p>text</p> I give you the answer. If you have another problem create new question and mark this as finish.
0

try this

// erase html tags from a string
public static string StripHtml(string target)
{
//Regular expression for html tags
Regex StripHTMLExpression = new Regex("<\\S[^><]*>", RegexOptions.IgnoreCase |   RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);

return StripHTMLExpression.Replace(target, string.Empty);
}

call

string htmlString="<div><span>hello world!</span></div>";
string strippedString=StripHtml(htmlString);

Comments

0

Assuming that:

  • the original string is always going to be in that specific format, and that
  • you cannot add the HTMLAgilityPack,

here is a quick and dirty way of getting what you want:

    static void Main(string[] args)
    {
        // Split original string on the 'separator' string.
        string originalString = "&lt;p&gt;LARGE&lt;/p&gt;&lt;p&gt;Lamb;<br>;li;ul;&nbsp;";
        string[] sSeparator = new string[] { "&lt;/p&gt;&lt;p&gt;" };
        string[] splitString = originalString.Split(sSeparator, StringSplitOptions.None);

        // Prepare to filter the 'prefix' and 'postscript' strings
        string prefix = "&lt;p&gt;";
        string postfix = ";<br>;li;ul;&nbsp;";
        int prefixLength = prefix.Length;
        int postfixLength = postfix.Length;

        // Iterate over the split string and clean up
        string s = string.Empty;
        for (int i = 0; i < splitString.Length; i++)
        {
            s = splitString[i];
            if (s.Contains(prefix))
            {
                s = s.Remove(s.IndexOf(prefix), prefixLength);

            }
            if (s.Contains(postfix))
            {
                s = s.Remove(s.IndexOf(postfix), postfixLength);
            }

            splitString[i] = s;
            Console.WriteLine(splitString[i]);
        }

        Console.ReadLine();
    }

Comments

0
// Convert &lt; &gt; etc. to HTML
String sResult = HttpUtility.HtmlDecode(sData);
// Remove HTML tags delimited by <>
String result = Regex.Replace(sResult, @"enter code here<[^>]*>", String.Empty);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.