0

I need a way to find only rendered IMG tags in a HTML snippet. So, I can't just regex the HTML snippet to find all IMG tags because I'd also get IMG tags that are shown as text in the HTML (not rendered).

I'm using Python on AppEngine.

Any ideas?

Thanks, Ivan

4
  • "I'd also get IMG tags that are shown as text in the HTML" - can you explain this / give an example? I'm not sure what you mean by that. Commented Apr 7, 2009 at 13:49
  • Are you saying you want the images which aren't 404ing? The ones which aren't in hidden divs? Commented Apr 7, 2009 at 14:28
  • Oh and I see it's another html regex question. sigh Commented Apr 7, 2009 at 14:28
  • on some webpages, there are code snippets shown, and those code snippets have IMG tags in them. so those IMG tags dont render as images, they're just shown as text. broken urls and hidden images are not an issue. Commented Apr 7, 2009 at 15:02

4 Answers 4

2

Sounds like a job for BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> doc = """
... <html>
... <body>
... <img src="test.jpg">
... &lt;img src="yay.jpg"&gt;
... <!-- <img src="ohnoes.jpg"> -->
... <img src="hurrah.jpg">
... </body>
... </html>
... """
>>> soup = BeautifulSoup(doc)
>>> soup.findAll('img')
[<img src="test.jpg" />, <img src="hurrah.jpg" />]

As you can see, BeautifulSoup is smart enough to ignore comments and displayed HTML.

EDIT: I'm not sure what you mean by the RSS feed escaping ALL images, though. I wouldn't expect BeautifulSoup to figure out which are meant to be shown if they are all escaped. Can you clarify?

Sign up to request clarification or add additional context in comments.

1 Comment

thanks! i'll give it a go. the scenario is actually a bit more complex - i'm parsing RSS content snippets which have all '<' and '<' escaped. so i'm wondering how the parser distinguishes between rendered img tags and nonredered img tags, since both are escaped...hm?
2

The source code for rendered img tag are something like this:

<img src="img.jpg"></img>

If the img tag is displayed as text(not rendered), the html code would be like this:

 &lt;img src=&quot;styles/BWLogo.jpg&quot;&gt;&lt;/img&gt;

&lt; is "<" character, &gt; is ">" character

To match rendered img tag only,you can use regex to match img tag formed by < and >, not &lt; and &gt;

Img tags in comments also need to be ignored by ingnoring characters between "<!--" and "-->"

1 Comment

Yeah, you are right. I think for comments, you can use regex to ingnore any character between "<!--" and "-->"
2

Use BeautifulSoup. It is an HTML/XML parser for Python that provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It probably won't be mistaken by fake img tags.

Comments

0

As image tags might be in between some <pre> or <xmp> tag you probably have to walk through the dom (= convert the html to a xml/dom tree and search through it) and find all the <img> nodes. There is a xml.dom class in the python standard library: docs.python.org

You could do that on the client aswell and report it back via ajax (this would mean more load on the server though).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.