0

i use this method to convert html to plaint text but it have some bugs in this html tags <H1,2,3,..>

Method :

public string HtmlToPlainText(string htmlText)
    {
        //const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
        const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
        const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
        var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
        var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
        //var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

        var text = htmlText;
        //Decode html specific characters
        text = System.Net.WebUtility.HtmlDecode(text);
        //Remove tag whitespace / line breaks
        //text = tagWhiteSpaceRegex.Replace(text, "><");
        //Replace < br /> with line breaks
        text = lineBreakRegex.Replace(text, Environment.NewLine);
        //Strip formatting
        text = stripFormattingRegex.Replace(text, string.Empty);
        return text;
    }

this is my html text :

<h3> This is a simple title </h3>
</br>
<p>Lorem ipsum <b> dolor sit </b> amet consectetur, <i>adipisicing elit.</i> </p>

This is my result :

This is a simple title Lorem ipsum dolor sit amet consectetur,
adipisicing elit.

The result should be :

This is a simple title

Lorem ipsum dolor sit amet consectetur, adipisicing elit.

I think the error is from Strip formatting. How can i solve it?

6
  • 1
    You shouldn't use regex to extract data from html. Commented Dec 25, 2021 at 13:05
  • Did you mean <br /> instead of </br>? Commented Dec 25, 2021 at 13:07
  • Does this answer your question? RegEx match open tags except XHTML self-contained tags Commented Dec 25, 2021 at 13:07
  • Why did you disclose that your question come a solution posted here? Commented Dec 25, 2021 at 13:17
  • Does this answer your question? How do you convert Html to plain text? Commented Dec 25, 2021 at 13:27

1 Answer 1

3

Parsing HTML is not an easy task (even for a subset of HTML). If regex feels like a good solution for this task it is actually not that great. To parse HTML, you should use ... an HTML parser. In C#, AngleSharp and the HTMLAgilityPack are the most common solution. Here is an example with AngleSharp:

using System;
using AngleSharp;
using AngleSharp.Html.Parser;

class MyClass {
    static void Main() {
        //Use the default configuration for AngleSharp
        var config = Configuration.Default;

        //Create a new context for evaluating webpages with the given config
        var context = BrowsingContext.New(config);

        //Source to be parsed
        var source = @"<h3> This is a simple title </h3>
</br>
<p>Lorem ipsum <b> dolor sit </b> amet consectetur, <i>adipisicing elit.</i> </p>
";

        //Create a parser to specify the document to load (here from our fixed string)
        var parser = context.GetService<IHtmlParser>();
        var document = parser.ParseDocument(source);

        //Do something with document like the following
        Console.WriteLine(document.DocumentElement.TextContent);
    }
}

Try it Online

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.