1

I have this program:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

using System.Xml;
using System.Text.RegularExpressions;
using System.IO;
using System.Net;

namespace Reviews_browser_test
{
    class Program
    {
        static void Main(string[] args)
        {

            Console.WriteLine(enter good, that u want to find: ");
            string tovar = Console.ReadLine();
            string page = "http://www.ulmart.ru/search?string=&rootCategory=&sort=6";
            page = page.Insert(35, tovar); // inserts good's id into url


            HttpWebRequest site = (HttpWebRequest)WebRequest.Create(page);

            HttpWebResponse response = (HttpWebResponse)site.GetResponse();
            Stream dataStream = response.GetResponseStream();
            StreamReader read = new StreamReader(dataStream);
            String data = read.ReadToEnd();
            Console.WriteLine(data);

            System.IO.File.WriteAllText("ulmart.html", data);

            Console.ReadKey();


            Match m;


            string pattern = "<span[^>]*?>[0-9]{4,10}</span>";


            m = Regex.Match(data, pattern);
            while (m.Success)
            {
                Console.WriteLine("Found an id " + m.Groups[1] + " at string "+ m.Groups[1].Index);
                m = m.NextMatch();
            }

            Console.ReadKey();
        }
    }
}

And I want to get all id numbers from the html file. But i don't know, why using this regex it doesn't find anything, while notepad++ finds each id fine. The example of html string, that should be found, using this regex:

<span class="num">3609304</span>

Where is my mistake?

6
  • Because you do not have m.Groups[1], it is empty as you do not have any capturing group in your regex. You can use <span[^>]*?>([0-9]{4,10})</span> and access the value with m.Groups[1].Value. However, you will be safer using an HTML parser relying on XPath to select exact elements you need rather than trying it the hard way with regex. Are you trying to get inner text of all span tags with class="num"? Commented Oct 12, 2015 at 10:08
  • 1
    HTML Aglity Pack is much better than Regex in these situations. Commented Oct 12, 2015 at 10:14
  • m value is still NULL Commented Oct 12, 2015 at 10:24
  • There is no span elements with the text you are looking for on that page. Commented Oct 12, 2015 at 10:27
  • No, they exist, notepad++ searches them with this regex The example: <span class="num">3609304</span> Commented Oct 12, 2015 at 10:28

1 Answer 1

1

The best way to solve the issue is to use HtmlAgilityPack. Install it as a NuGet package, and use the following method:

public List<string> HtmlAgilityPackGetNumericSpan4to10(string html)
{
        var vals = new List<string>();
        HtmlAgilityPack.HtmlDocument hap;
        Uri uriResult;
        if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) 
                            && uriResult.Scheme == Uri.UriSchemeHttp)
        { // html is a URL 
            var doc = new HtmlAgilityPack.HtmlWeb();
            hap = doc.Load(uriResult.AbsoluteUri);
        }
        else
        { // html is a string
            hap = new HtmlAgilityPack.HtmlDocument();
            hap.LoadHtml(html);
        }
        var nodes = hap.DocumentNode.SelectNodes("//span[@class='num']");
        if (nodes != null)
        {
            foreach (var node in nodes)
            {
                var val = node.InnerText;
                if (val.ToCharArray().All(p => Char.IsDigit(p)) 
                                 && val.Length >= 4 && val.Length <= 10)
                    vals.Add(val);
            }
        }
        return vals;
}

With "//span[@class='num']" we collect only the span tags that have class attribute value equal to num. With if (val.ToCharArray().All(p => Char.IsDigit(p)) && val.Length >= 4 && val.Length <= 10) we check if the inner text is all numeric and its length is from 4 to 10.

Result with just your example string:

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

I have ab error on this: HtmlAgilityPackGetNumericSpan4to10Severity Error CS0116 A namespace cannot directly contain members such as fields or methods
Use it inside a class.
Glad it worked for you. Please consider also upvoting if my answer proved helpful to you (see How to upvote on Stack Overflow?).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.