3

I am working with some html contents. The format of the HTML is like below.

<li>
  <ul>
     <li>Test1</li>
     <li>Test2</li>
  </ul>
  Odd string 1
  <ul>
     <li>Test3</li>
     <li>Test4</li>
  </ul>
  Odd string 2
  <ul>
     <li>Test5</li>
     <li>Test6</li>
  </ul>
<li>

There can be multiple "odd string" in html content. So I want all the "odd string" in array. Is there any easy way ? (I am using C# and HtmlAgilityPack)

2
  • will they always be between </ul> and <ul>? Commented Jul 5, 2013 at 12:04
  • @Jonesy Yes they will be always between </ul> and <ul> Commented Jul 5, 2013 at 12:07

5 Answers 5

3

Select ul elements and refer to next sibling node, which will be your text:

HtmlDocument html = new HtmlDocument();
html.Load(html_file);
var odds = from ul in html.DocumentNode.Descendants("ul")
           let sibling = ul.NextSibling
           where sibling != null && 
                 sibling.NodeType == HtmlNodeType.Text && // check if text node
                 !String.IsNullOrWhiteSpace(sibling.InnerHtml)
           select sibling.InnerHtml.Trim();
Sign up to request clarification or add additional context in comments.

Comments

1

something like

MatchCollection matches = Regex.Matches(HTMLString, "</ul>.*?<ul>", RegexOptions.SingleLine);
foreach (Match match in matches)
{
    String oddstring = match.ToString().Replace("</ul>","").Replace("<ul>","");
}

1 Comment

OP probably needs a solution using HtmlAgilityPack (notice tags and the last sentence of the question)
0

Get all the ul descendants and check it the next sibling node is HtmlNodeType.Text and if is not empty:

List<string>oddStrings = new List<string>();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode ul in doc.DocumentNode.Descendants("ul"))
{
    HtmlNode nextSibling = ul.NextSibling;
    if (nextSibling != null && nextSibling.NodeType == HtmlNodeType.Text)
    {
        string trimmedText = nextSibling.InnerText.Trim();
        if (!String.IsNullOrEmpty(trimmedText))
        {
            oddStrings.Add(trimmedText);
        }
    }
}

Comments

0

Agility Pack can already query those texts

var nodes = doc.DocumentNode.SelectNodes("/html[1]/body[1]/li[1]/text()")

Comments

0

Use this XPATH:

//body/li[1]/text()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.