3

I need to parse sections from a string of HTML. For example:

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>[section=quote]</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>[/section]</p>

Parsing the quote section should return:

<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>

Currently I'm using a regular expression to grab the content inside [section=quote]...[/section], but since the sections are entered using a WYSIWYG editor, the section tags themselves get wrapped in a paragraph tag, so the parsed result is:

</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>

The Regular Expression I'm using currently is:

\[section=(.+?)\](.+?)\[/section\]

And I'm also doing some additional cleanup prior to parsing the sections:

protected string CleanHtml(string input) {
    // remove whitespace
    input = Regex.Replace(input, @"\s*(<[^>]+>)\s*", "$1", RegexOptions.Singleline);
    // remove empty p elements
    input = Regex.Replace(input, @"<p\s*/>|<p>\s*</p>", string.Empty);
    return input;
}

Can anyone provide a regular expression that would achieve what I am looking for or am I wasting my time trying to do this with Regex? I've seen references to the Html Agility Pack - would this be better for something like this?

[Update]

Thanks to Oscar I have used a combination of the HTML Agility pack and Regex to parse the sections. It still needs a bit of refining but it's nearly there.

public void ParseSections(string content)
{
    this.SourceContent = content;
    this.NonSectionedContent = content;

    content = CleanHtml(content);

    if (!sectionRegex.IsMatch(content))
        return;

    var doc = new HtmlDocument();
    doc.LoadHtml(content);

    bool flag = false;
    string sectionName = string.Empty;
    var sectionContent = new StringBuilder();
    var unsectioned = new StringBuilder();

    foreach (var n in doc.DocumentNode.SelectNodes("//p")) {               
        if (startSectionRegex.IsMatch(n.InnerText)) { 
            flag = true;
            sectionName = startSectionRegex.Match(n.InnerText).Groups[1].Value.ToLowerInvariant();
            continue;
        }
        if (endSectionRegex.IsMatch(n.InnerText)) {
            flag = false;
            this.Sections.Add(sectionName, sectionContent.ToString());
            sectionContent.Clear();
            continue;
        }

        if (flag)
            sectionContent.Append(n.OuterHtml);
        else
            unsectioned.Append(n.OuterHtml);
    }

    this.NonSectionedContent = unsectioned.ToString();
}
2
  • 4
    Obligatory link to stackoverflow.com/questions/1732348/… Commented Feb 8, 2011 at 10:36
  • Parsing html with regex is usually a bad idea, as html is not regular. If you can take a look at a html parser, they are many available, and they will cause far less pain. Commented Feb 8, 2011 at 11:24

2 Answers 2

2

The following works, using HtmlAgilityPack library:

using HtmlAgilityPack;

...

HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\file.html");


bool flag = false;
var sb = new StringBuilder();
foreach (var n in doc.DocumentNode.SelectNodes("//p"))
{
    switch (n.InnerText)
    {
        case "[section=quote]":
            flag = true;
            continue;
        case "[/section]":
            flag = false;
            break;
    }
    if (flag)
    {
        sb.AppendLine(n.OuterHtml);
    }
}

Console.Write(sb);
Console.ReadLine();

If you just want to print Mauris at turpis nec dolor bibendum sollicitudin ac quis neque. without <p>...</p>, you can replace n.OuterHtml by n.InnerHtml.

Of course, you should check if doc.DocumentNode.SelectNodes("//p") is null.
If you want to load the html from an online source instead of a file, you can do:

var htmlWeb = new HtmlWeb();  
var doc = htmlWeb.Load("http://..../page.html");

Edit:

If [section=quote] an [/section] could be inside any tag (not always <p>), you can replace doc.DocumentNode.SelectNodes("//p") by doc.DocumentNode.SelectNodes("//*").

Sign up to request clarification or add additional context in comments.

1 Comment

Wow thanks. I only just noticed your reply. Let me give it a whirl!
1

How about replacing

<p>[section=quote]</p>

with

[section=quote]

and

<p>[/section]</p>

with

[/section]

as part of your cleanup. Then you can use your existing regular expression.

1 Comment

since the html content is entirely in the hands of the user, I don't actually know what the [section] tags will be wrapped in (could be div, p, anything).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.