How to get Contents from HTML string in Array

Question

I am working with some html contents. The format of the HTML is like below.

<li>
  <ul>
     <li>Test1</li>
     <li>Test2</li>
  </ul>
  Odd string 1
  <ul>
     <li>Test3</li>
     <li>Test4</li>
  </ul>
  Odd string 2
  <ul>
     <li>Test5</li>
     <li>Test6</li>
  </ul>
<li>

There can be multiple "odd string" in html content. So I want all the "odd string" in array. Is there any easy way ? (I am using C# and HtmlAgilityPack)

will they always be between </ul> and <ul>?

Jonesopolis
– Jonesopolis

2013-07-05 12:04:47 +00:00
Commented Jul 5, 2013 at 12:04 — Jonesopolis
– Jonesopolis, Commented Jul 5, 2013 at 12:04
@Jonesy Yes they will be always between </ul> and <ul>

Debajit Mukhopadhyay
– Debajit Mukhopadhyay

2013-07-05 12:07:57 +00:00
Commented Jul 5, 2013 at 12:07 — Debajit Mukhopadhyay
– Debajit Mukhopadhyay, Commented Jul 5, 2013 at 12:07

Sergey Berezovskiy · Accepted Answer · 2013-07-05 12:17:48Z

3

Select ul elements and refer to next sibling node, which will be your text:

HtmlDocument html = new HtmlDocument();
html.Load(html_file);
var odds = from ul in html.DocumentNode.Descendants("ul")
           let sibling = ul.NextSibling
           where sibling != null && 
                 sibling.NodeType == HtmlNodeType.Text && // check if text node
                 !String.IsNullOrWhiteSpace(sibling.InnerHtml)
           select sibling.InnerHtml.Trim();

answered Jul 5, 2013 at 12:17

Sergey Berezovskiy

237k44 gold badges441 silver badges468 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jonesopolis · Accepted Answer · 2013-07-05 12:14:04Z

1

something like

MatchCollection matches = Regex.Matches(HTMLString, "</ul>.*?<ul>", RegexOptions.SingleLine);
foreach (Match match in matches)
{
    String oddstring = match.ToString().Replace("</ul>","").Replace("<ul>","");
}

answered Jul 5, 2013 at 12:14

Jonesopolis

25.4k12 gold badges72 silver badges115 bronze badges

1 Comment

default locale Over a year ago

OP probably needs a solution using HtmlAgilityPack (notice tags and the last sentence of the question)

Marek Musielak · Accepted Answer · 2013-07-05 12:26:29Z

0

Get all the ul descendants and check it the next sibling node is HtmlNodeType.Text and if is not empty:

List<string>oddStrings = new List<string>();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode ul in doc.DocumentNode.Descendants("ul"))
{
    HtmlNode nextSibling = ul.NextSibling;
    if (nextSibling != null && nextSibling.NodeType == HtmlNodeType.Text)
    {
        string trimmedText = nextSibling.InnerText.Trim();
        if (!String.IsNullOrEmpty(trimmedText))
        {
            oddStrings.Add(trimmedText);
        }
    }
}

answered Jul 5, 2013 at 12:26

Marek Musielak

27.2k8 gold badges77 silver badges82 bronze badges

Comments

rajeemcariazo · Accepted Answer · 2013-07-05 12:48:42Z

0

Agility Pack can already query those texts

var nodes = doc.DocumentNode.SelectNodes("/html[1]/body[1]/li[1]/text()")

edited Jul 5, 2013 at 12:48

answered Jul 5, 2013 at 12:31

rajeemcariazo

2,5445 gold badges39 silver badges63 bronze badges

Comments

DonBoitnott · Accepted Answer · 2013-07-05 12:59:52Z

0

Use this XPATH:

//body/li[1]/text()

edited Jul 5, 2013 at 12:59

DonBoitnott

11.1k7 gold badges53 silver badges70 bronze badges

answered Jul 5, 2013 at 12:40

JunoPatch

53 bronze badges

Collectives™ on Stack Overflow

How to get Contents from HTML string in Array

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related