c# - reading HTML?

Question

I'm developing a program in C# and I require some help. I'm trying to create an array or a list of items, that display on a certain website. What I'm trying to do is read the anchor text and it's href. So for example, this is the HTML:

<div class="menu-1">
    <div class="items">
        <div class="minor">
            <ul>
                <li class="menu-item">
                    <a class="menu-link" title="Item-1" id="menu-item-1"
                    href="/?item=1">Item 1</a>
                </li>
                <li class="menu-item">
                    <a class="menu-link" title="Item-1" id="menu-item-2"
                    href="/?item=2">Item 2</a>
                </li>
                <li class="menu-item">
                    <a class="menu-link" title="Item-1" id="menu-item-3"
                    href="/?item=3">Item 3</a>
                </li>
                <li class="menu-item">
                    <a class="menu-link" title="Item-1" id="menu-item-4"
                    href="/?item=4">Item 4</a>
                </li>
                <li class="menu-item">
                    <a class="menu-link" title="Item-1" id="menu-item-5"
                    href="/?item=5">Item 5</a>
                </li>
            </ul>
        </div>
    </div>
</div>

So from that HTML I would like to read this:

string[,] array = {{"Item 1", "/?item=1"}, {"Item 2", "/?item=2"},
    {"Item 3", "/?item=3"}, {"Item 4", "/?item=4"}, {"Item 5", "/?item=5"}};

The HTML is an example I had written, the actual site does not look like that.

Did you try to take a look to the XmlTextReader stream? You'll catch all the a and - plus - it's quick even with a big XML file. — Adriano Repetti
– Adriano Repetti, Commented May 22, 2012 at 20:11

Antonio Bakula · Accepted Answer · 2012-05-22 20:35:29Z

9

As others said HtmlAgilityPack is the best for html parsing, also be sure to download HAP Explorer from HtmlAgilityPack site, use it to test your selects, anyway this SelectNode command will get all anchors that have ID and it start with menu-item :

  HtmlDocument doc = new HtmlDocument();
  doc.Load(htmlFile);
  var myNodes = doc.DocumentNode.SelectNodes("//a[starts-with(@id,'menu-item-')]");
  foreach (HtmlNode node in myNodes)
  {
    Console.WriteLine(node.Id);

  }

answered May 22, 2012 at 20:35

Antonio Bakula

20.8k7 gold badges81 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

paul1923 Over a year ago

The HAP Explorer ? Where do you get that from ?

Antonio Bakula Over a year ago

try here github.com/tomap/HtmlAgilityPack/tree/master/HAPExplorer

MiMo · Accepted Answer · 2012-05-22 20:18:05Z

2

If the HTML is valid XML you can load it using the XmlDocument class and then access the pieces you want using XPaths, or you can use and XmlReader as Adriano suggests (a bit more work).

If the HTML is not valid XML I'd suggest to use some existing HTML parsers - see for example this - that worked OK for us.

answered May 22, 2012 at 20:18

MiMo

12k1 gold badge35 silver badges48 bronze badges

1 Comment

Toni Over a year ago

XML Validator

Gregoire · Accepted Answer · 2012-05-22 20:24:01Z

1

You can also use the HtmlAgility pack

answered May 22, 2012 at 20:24

Gregoire

24.9k7 gold badges50 silver badges74 bronze badges

Comments

Helstein · Accepted Answer · 2012-05-22 20:33:34Z

1

I think this case is simple enough to use a regular expression, like <a.*title="([^"]*)".*href="([^"]*)":

string strRegex = @"<a.*title=""([^""]*)"".*href=""([^""]*)""";
RegexOptions myRegexOptions = RegexOptions.None;
Regex myRegex = new Regex(strRegex, myRegexOptions);

string strTargetString = ...;

foreach (Match myMatch in myRegex.Matches(strTargetString))
{
  if (myMatch.Success)
  {
    // Use the groups matched
  }
}

answered May 22, 2012 at 20:33

Helstein

3302 silver badges6 bronze badges

1 Comment

MiMo Over a year ago

stackoverflow.com/questions/1732348/…

Collectives™ on Stack Overflow

c# - reading HTML?

4 Answers 4

2 Comments

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related