29

I'm looking for a regular expression to isolate the src value of an img. (I know that this is not the best way to do this but this is what I have to do in this case)

I have a string which contains simple html code, some text and an image. I need to get the value of the src attribute from that string. I have managed only to isolate the whole tag till now.

string matchString = Regex.Match(original_text, @"(<img([^>]+)>)").Value;
2
  • Run a second regex on the img tag to get the src attribute Commented Nov 23, 2010 at 15:06
  • 3
    Obligatory link to this related answer Commented Nov 23, 2010 at 15:16

8 Answers 8

54
string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
Sign up to request clarification or add additional context in comments.

5 Comments

This Regex will only work if Src is the first attribute of the Image, If Src comes after ID or some other attributes, then it'll not work
@ShreekumarS why? There is a .+? between img and src, so there can be all kinds of characters ...
This one is Fine Regex.Match(original_text, "<img.*?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
I would make it a little more greedier, <img.*src=[\"'](.+?)[\"'].*> .* instead of .+?, certainly for the last one, otherwise you always require 1 minimum character. It might not be there if they just close the img tag right after the src attribute.
It's not a good idea to make this greedy, what if there are more than one img-elements? Your expression might capture all these elements as one match. But you are right about the end of my expression, I changed it to .*? to allow the element to end after the src attribute. The first .+? is still right, there has to be at least one character between img and src: the space ...
15

I know you say you have to use regex, but if possible i would really give this open source project a chance: HtmlAgilityPack

It is really easy to use, I just discovered it and it helped me out a lot, since I was doing some heavier html parsing. It basically lets you use XPATHS to get your elements.

Their example page is a little outdated, but the API is really easy to understand, and if you are a little bit familiar with xpaths you will get head around it in now time

The code for your query would look something like this: (uncompiled code)

 List<string> imgScrs = new List<string>();
 HtmlDocument doc = new HtmlDocument();
 doc.LoadHtml(htmlText);//or doc.Load(htmlFileStream)
 var nodes = doc.DocumentNode.SelectNodes(@"//img[@src]"); s
 foreach (var img in nodes)
 {
    HtmlAttribute att = img["src"];
    imgScrs.Add(att.Value)
 }

1 Comment

I tried this, but it looks like the HtmlAgilityPack's api has changed. I have posted an alternative solution to this question
7

I tried what Francisco Noriega suggested, but it looks that the api to the HtmlAgilityPack has been altered. Here is how I solved it:

        List<string> images = new List<string>();
        WebClient client = new WebClient();
        string site = "http://www.mysite.com";
        var htmlText = client.DownloadString(site);

        var htmlDoc = new HtmlDocument()
                    {
                        OptionFixNestedTags = true,
                        OptionAutoCloseOnEnd = true
                    };

        htmlDoc.LoadHtml(htmlText);

        foreach (HtmlNode img in htmlDoc.DocumentNode.SelectNodes("//img"))
        {
            HtmlAttribute att = img.Attributes["src"];
            images.Add(att.Value);
        }

2 Comments

You should really put //img[@src] in the SelectNodes call (or check for its existence before getting the att.Value.. And either check the result for null or tack ?? new HtmlNodeCollection(null); to the call of SelctNodes. You'll get NullReferenceException otherwise.
Instead of adding a new answer, you could also edit the original answer to remove the errors contained in there.
3

This should capture all img tags and just the src part no matter where its located (before or after class etc) and supports html/xhtml :D

<img.+?src="(.+?)".+?/?>

Comments

2

The regex you want should be along the lines of:

(<img.*?src="([^"])".*?>)

Hope this helps.

Comments

1

you can also use a look behind to do it without needing to pull out a group

(?<=<img.*?src=")[^"]*

remember to escape the quotes if needed

Comments

0

This is what I use to get the tags out of strings:

</? *img[^>]*>

Comments

-1

Here is the one I use:

<img.*?src\s*?=\s*?(?:(['"])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))[^>]*?>

The good part is that it matches any of the below:

<img src='test.jpg'>
<img src=test.jpg>
<img src="test.jpg">

And it can also match some unexpected scenarios like extra attributes, e.g:

<img src = "test.jpg" width="300">

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.