Extract strings using Regex

Question

I want to download a html source, then search for the username and other information, and then display this in my program. I'm pretty new to programming, but a straight noob when it comes to things like this (Regex) so I hope you can explain it to me.

I used Regex before extracting a K/D ratio from a html source, for that I used this code:

string pattern = @"<span class=""kdratio"">\d+\.\d+";

But I have no idea how to start on this one...

This is the line of the source that contains the information:

<section class="profile-header" profile="true" motto="user's motto" user="User" figure="hr-3322-45.hd-190-1.ch-3342-64-66.lg-285-64.sh-3068-82-66.ea-1404-64">

I only need the parts user="User" and figure="x", I couldn't try anything because I really wouldn't know how to start, because the html line looks so different from what I have experience with.

user="([^"]*?)" figure="([^"]*?)" as regex would work ( i.sstatic.net/i2Nkt.png). But it'd better to use an html parser to extract the values of the attributes user and figure of this section element, the class="profile-header" seems to be a good unique identifier for it. Take a look at stackoverflow.com/questions/846994/how-to-use-html-agility-pack to get to know how to use HTMLAgility Pack to parse the html, find the node (<section>) and extract attributes out of it. — Maximilian Gerhardt
– Maximilian Gerhardt, Commented Jan 24, 2016 at 1:04

wp78de · Accepted Answer · 2017-11-28 02:45:08Z

3

Regular expressions are not a good idea for matching HTML unless it's very simple, single, tag matching. See here: RegEx match open tags except XHTML self-contained tags

I recommend using an HTML DOM-parsing library and use XPath or CSS selectors to get the information you want. For .NET, HtmlAgilityPack is recommended. For CSS Selectors you'll want Fizzler (an add-on for HtmlAgilityPack).

In JavaScript (easily rewritten to C# and HtmlAgilityPack) it would be this:

document.querySelector(
    "section[class=profile-header][profile=true][user=User]"
).textContent

HtmlAgilityPack: http://html-agility-pack.net
Fizzler: https://www.nuget.org/packages/Fizzler.Systems.HtmlAgilityPack/

edited Nov 28, 2017 at 2:45

wp78de

19.1k7 gold badges49 silver badges78 bronze badges

answered Jan 24, 2016 at 1:09

Dai

158k31 gold badges314 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Remi Over a year ago

Yes that's what i was afraid of... Many people suggest HtmlAgilityPack but it's always been a mystery for me what it is and how to use it, time to find it i guess.

Arin Ghazarian · Accepted Answer · 2016-01-24 01:08:54Z

0

Generally for parsing HTML, Regex is not a good choice! HTML tends to be so complicated and it is so hard to write a single Regex to be able to match everything! Instead use a parser like Html Agility Pack.

answered Jan 24, 2016 at 1:08

Arin Ghazarian

5,3353 gold badges25 silver badges22 bronze badges

Collectives™ on Stack Overflow

Extract strings using Regex

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related