Why my Regex expression doesn't work?

Question

I have this program:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

using System.Xml;
using System.Text.RegularExpressions;
using System.IO;
using System.Net;

namespace Reviews_browser_test
{
    class Program
    {
        static void Main(string[] args)
        {

            Console.WriteLine(enter good, that u want to find: ");
            string tovar = Console.ReadLine();
            string page = "http://www.ulmart.ru/search?string=&rootCategory=&sort=6";
            page = page.Insert(35, tovar); // inserts good's id into url


            HttpWebRequest site = (HttpWebRequest)WebRequest.Create(page);

            HttpWebResponse response = (HttpWebResponse)site.GetResponse();
            Stream dataStream = response.GetResponseStream();
            StreamReader read = new StreamReader(dataStream);
            String data = read.ReadToEnd();
            Console.WriteLine(data);

            System.IO.File.WriteAllText("ulmart.html", data);

            Console.ReadKey();


            Match m;


            string pattern = "<span[^>]*?>[0-9]{4,10}</span>";


            m = Regex.Match(data, pattern);
            while (m.Success)
            {
                Console.WriteLine("Found an id " + m.Groups[1] + " at string "+ m.Groups[1].Index);
                m = m.NextMatch();
            }

            Console.ReadKey();
        }
    }
}

And I want to get all id numbers from the html file. But i don't know, why using this regex it doesn't find anything, while notepad++ finds each id fine. The example of html string, that should be found, using this regex:

<span class="num">3609304</span>

Where is my mistake?

Because you do not have m.Groups[1], it is empty as you do not have any capturing group in your regex. You can use <span[^>]*?>([0-9]{4,10})</span> and access the value with m.Groups[1].Value. However, you will be safer using an HTML parser relying on XPath to select exact elements you need rather than trying it the hard way with regex. Are you trying to get inner text of all span tags with class="num"? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 12, 2015 at 10:08
HTML Aglity Pack is much better than Regex in these situations. — Fᴀʀʜᴀɴ Aɴᴀᴍ
– Fᴀʀʜᴀɴ Aɴᴀᴍ, Commented Oct 12, 2015 at 10:14
There is no span elements with the text you are looking for on that page. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 12, 2015 at 10:27
No, they exist, notepad++ searches them with this regex The example: <span class="num">3609304</span> — Богдан Лашков
– Богдан Лашков, Commented Oct 12, 2015 at 10:28

Wiktor Stribiżew · Accepted Answer · 2015-10-12 10:41:20Z

1

The best way to solve the issue is to use HtmlAgilityPack. Install it as a NuGet package, and use the following method:

public List<string> HtmlAgilityPackGetNumericSpan4to10(string html)
{
        var vals = new List<string>();
        HtmlAgilityPack.HtmlDocument hap;
        Uri uriResult;
        if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) 
                            && uriResult.Scheme == Uri.UriSchemeHttp)
        { // html is a URL 
            var doc = new HtmlAgilityPack.HtmlWeb();
            hap = doc.Load(uriResult.AbsoluteUri);
        }
        else
        { // html is a string
            hap = new HtmlAgilityPack.HtmlDocument();
            hap.LoadHtml(html);
        }
        var nodes = hap.DocumentNode.SelectNodes("//span[@class='num']");
        if (nodes != null)
        {
            foreach (var node in nodes)
            {
                var val = node.InnerText;
                if (val.ToCharArray().All(p => Char.IsDigit(p)) 
                                 && val.Length >= 4 && val.Length <= 10)
                    vals.Add(val);
            }
        }
        return vals;
}

With "//span[@class='num']" we collect only the span tags that have class attribute value equal to num. With if (val.ToCharArray().All(p => Char.IsDigit(p)) && val.Length >= 4 && val.Length <= 10) we check if the inner text is all numeric and its length is from 4 to 10.

Result with just your example string:

answered Oct 12, 2015 at 10:41

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Богдан Лашков Over a year ago

I have ab error on this: HtmlAgilityPackGetNumericSpan4to10Severity Error CS0116 A namespace cannot directly contain members such as fields or methods

Wiktor Stribiżew Over a year ago

Use it inside a class.

Wiktor Stribiżew Over a year ago

Glad it worked for you. Please consider also upvoting if my answer proved helpful to you (see How to upvote on Stack Overflow?).

Collectives™ on Stack Overflow

Why my Regex expression doesn't work?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related