Finding all rendered images in a HTML file

Question

I need a way to find only rendered IMG tags in a HTML snippet. So, I can't just regex the HTML snippet to find all IMG tags because I'd also get IMG tags that are shown as text in the HTML (not rendered).

I'm using Python on AppEngine.

Any ideas?

Thanks, Ivan

"I'd also get IMG tags that are shown as text in the HTML" - can you explain this / give an example? I'm not sure what you mean by that. — AdamKG
– AdamKG, Commented Apr 7, 2009 at 13:49
Are you saying you want the images which aren't 404ing? The ones which aren't in hidden divs? — annakata
– annakata, Commented Apr 7, 2009 at 14:28
on some webpages, there are code snippets shown, and those code snippets have IMG tags in them. so those IMG tags dont render as images, they're just shown as text. broken urls and hidden images are not an issue. — user88104
– user88104, Commented Apr 7, 2009 at 15:02

Paolo Bergantino · Accepted Answer · 2009-04-08 19:02:37Z

2

Sounds like a job for BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> doc = """
... <html>
... <body>
... <img src="test.jpg">
... &lt;img src="yay.jpg"&gt;
... <!-- <img src="ohnoes.jpg"> -->
... <img src="hurrah.jpg">
... </body>
... </html>
... """
>>> soup = BeautifulSoup(doc)
>>> soup.findAll('img')
[<img src="test.jpg" />, <img src="hurrah.jpg" />]

As you can see, BeautifulSoup is smart enough to ignore comments and displayed HTML.

EDIT: I'm not sure what you mean by the RSS feed escaping ALL images, though. I wouldn't expect BeautifulSoup to figure out which are meant to be shown if they are all escaped. Can you clarify?

edited Apr 8, 2009 at 19:02

answered Apr 7, 2009 at 18:51

Paolo Bergantino

490k83 gold badges523 silver badges437 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user88104 Over a year ago

thanks! i'll give it a go. the scenario is actually a bit more complex - i'm parsing RSS content snippets which have all '<' and '<' escaped. so i'm wondering how the parser distinguishes between rendered img tags and nonredered img tags, since both are escaped...hm?

MSalters · Accepted Answer · 2009-04-07 14:32:16Z

2

The source code for rendered img tag are something like this:

<img src="img.jpg"></img>

If the img tag is displayed as text(not rendered), the html code would be like this:

 &lt;img src=&quot;styles/BWLogo.jpg&quot;&gt;&lt;/img&gt;

< is "<" character, > is ">" character

To match rendered img tag only,you can use regex to match img tag formed by < and >, not < and >

Img tags in comments also need to be ignored by ingnoring characters between ""

edited Apr 7, 2009 at 14:32

MSalters

182k11 gold badges171 silver badges376 bronze badges

answered Apr 7, 2009 at 14:11

wschenkai

1853 silver badges8 bronze badges

1 Comment

wschenkai Over a year ago

Yeah, you are right. I think for comments, you can use regex to ingnore any character between ""

nosklo · Accepted Answer · 2009-04-07 18:31:30Z

2

Use BeautifulSoup. It is an HTML/XML parser for Python that provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It probably won't be mistaken by fake img tags.

answered Apr 7, 2009 at 18:31

nosklo

224k58 gold badges300 silver badges299 bronze badges

Comments

FloH · Accepted Answer · 2009-04-07 18:28:30Z

0

As image tags might be in between some <pre> or <xmp> tag you probably have to walk through the dom (= convert the html to a xml/dom tree and search through it) and find all the <img> nodes. There is a xml.dom class in the python standard library: docs.python.org

You could do that on the client aswell and report it back via ajax (this would mean more load on the server though).

answered Apr 7, 2009 at 18:28

FloH

Collectives™ on Stack Overflow

Finding all rendered images in a HTML file

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related