1

I want to make a regex that will match links in HTML code. This is example that will explain it better. Something like this:

<a href="I NEED THIS1">  <img src="I NEED THIS2">  </a>  <a href="I DONT
NEED THIS" title="something">  </a>   <a href="I NEED THIS3" title="blah">
<figure> <img src="I NEED THIS4" alt="">   </figure>  </a>

I tried something like this, but it matches I DONT NEED THIS instead of I NEED THIS3.

<a href="([^"]*)"\s*.*?<img src="(.*?)".*?\s*<\/a>

I tried to add negative lookahead with , but no matter where I put it, it is like I didn't add it at all. I am not sure I understand negative lookahead correct, but I tried to add (?!</a>).

I used regex that finds words near each other, and it works, but it is really not very elegant solution :) It finds href and img src when distance between is 0 and 7 words:

<a href="([^"]*)"\W+(?:\w+\W+){0,7}?<img src="(.*?)".*?\s*<\/a>

It will be used in Excel VBA and I was testing it on online regex tester websites.
Any suggestion would be helpful.

6
  • If you are reading HTML from the Web, you can use InternetExplorer.Application object. Then, you can parse the DOM easily, maybe easier than with the regex. Commented May 26, 2016 at 12:09
  • I need it to be done with regex and needs to be solved with one expression only. Two pass would be probably easier, but unfortunately not allowed to use it. Commented May 26, 2016 at 12:25
  • Ok, try <a\b[^<]*\bhref="([^"]*)"[^<]*>(?:(?!</?a\b[^<]*>)[\s\S])*<img\b[^<]*\bsrc="([^"]*)". Commented May 26, 2016 at 12:36
  • Thank you, Wiktor. It looks like that this is the correct regex, it is working good for me. I will test it some more. Could you please try to explain this part of expresion: (?:(?!<\/?a\b[^<]*>)[\s\S])* Commented May 26, 2016 at 12:43
  • Yeah, Wiktor, this definitely works. Thanks so much, you rock! :) Commented May 26, 2016 at 13:33

2 Answers 2

1

Use the MSHTML parser:

Dim odoc As Object: Set odoc = CreateObject("htmlfile")
odoc.Open
odoc.Write htmlstr

For Each element In odoc.images
    MsgBox element.src
Next

For Each element In odoc.getElementsByTagName("a")
    MsgBox element.href
Next

You may need to remove a leading "about:" prefix.

Sign up to request clarification or add additional context in comments.

1 Comment

This is correct according the famous advice at stackoverflow.com/a/1732454/122139 that has helped thousands.
0

Here's another solution.

(href="([^"]+).*(?=img src))|(img src="([^"]*))
  1. check for href="
  2. return everything before the next " -> first group you're interested in
  3. but only if there is img src following (positive lookahead)
  4. check for img src="
  5. return everything before the next " -> second group you're interested in

Demo: https://regex101.com/r/yS9bB4/1

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.