How to parse array of HTML tags with python beautifulsoup

Question

How can I parse with python beautifulsoup the following code? I need to get each image with its corresponding width and height properties (if they exist).

The code below "means there are 3 images on this page, the first image is 300x300, the middle one has unspecified dimensions, and the last one is 1000px tall" (as explained here)

<meta property="og:image" content="http://example.com/rock.jpg" />
<meta property="og:image:width" content="300" />
<meta property="og:image:height" content="300" />
<meta property="og:image" content="http://example.com/rock2.jpg" />
<meta property="og:image" content="http://example.com/rock3.jpg" />
<meta property="og:image:height" content="1000" />

So far I have the following code, but it only returns the first set of dimensions:

images = []
img_list = soup.findAll('meta', {"property":'og:image'})
for og_image in img_list:
    if not og_image.get('content'):
        continue

    image = {'url': og_image['content']}

    width = self.soup.find('meta', {"property":'og:image:width'})
    if width:
        image['width'] = width['content']
    height = self.soup.find('meta', {"property":'og:image:height'})
    if width:
        image['height'] = height['content']

    images.append(image)

Thanks!

as always; what have you tried? There's a ton of examples out there... — Fredrik Pihl
– Fredrik Pihl, Commented Aug 15, 2012 at 9:42
I have updated the question with the code I have tried. I am having troubles matching the image with its dimensions. — Alex
– Alex, Commented Aug 15, 2012 at 9:48

PaulMcG · Accepted Answer · 2012-08-15 09:59:15Z

It's not BeautifulSoup, but a pyparsing approach is pretty quick to knock together:

html = """
<meta property="og:image" content="http://example.com/rock.jpg" /> 
<meta property="og:image:width" content="300" /> 
<meta property="og:image:height" content="300" /> 
<meta property="og:image" content="http://example.com/rock2.jpg" /> 
<meta property="og:image" content="http://example.com/rock3.jpg" /> 
<meta property="og:image:height" content="1000" /> 
"""

from pyparsing import makeHTMLTags, withAttribute, Optional, Group

# use makeHTMLTags to define tag expressions (allows attributes, whitespace, 
# closing '/', etc., and sets up results names for matched attributes so they
# are easy to get at later)
meta,metaEnd = makeHTMLTags("meta")

# define a copy of the opening tag, filtering on the specific attribute to select for
img_meta = meta.copy().setParseAction(withAttribute(('property','og:image')))
wid_meta = meta.copy().setParseAction(withAttribute(('property','og:image:width')))
hgt_meta = meta.copy().setParseAction(withAttribute(('property','og:image:height')))

# now define the overall expression to look for, and assign names for subexpressions
# for width and height
img_ref = img_meta + Optional(Group(wid_meta)("width")) + Optional(Group(hgt_meta)("height"))

# use searchString to scan through the given text looking for matches
for img in img_ref.searchString(html):
    print "IMAGE:", img.content
    if img.height:
        print "H:", img.height.content
    if img.width:
        print "W:", img.width.content
    print

Prints:

IMAGE: http://example.com/rock.jpg
H: 300
W: 300

IMAGE: http://example.com/rock2.jpg

IMAGE: http://example.com/rock3.jpg
H: 1000

Thanks a lot, it's pretty straight through with pyparsing! The problem is that I am already using beautifulsoup for the rest of the code, and it would take too much time for another parser to load and parse.

Alex · Accepted Answer · 2012-10-01 14:11:36Z

I want something fast, which uses beautifulsoup tree structure. Here is the solution I found suitable, in case there are people looking for something similar:

from BeautifulSoup import BeautifulSoup, Tag

soup = BeautifulSoup(html)
images = []
image = {}

img_list = soup.findAll('meta', {"property":'og:image'})
for og_image in img_list:
    if not og_image.get('content'):
        continue

    image = {'url': og_image['content']}
    next = og_image.nextSibling.nextSibling # calling once returns end of line char '\n'

    if next and isinstance(next, Tag) and next.get('property', '').startswith('og:image:'):
        dimension = next['content']
        prop = next.get('property').rsplit(':')[-1]
        image[prop] = dimension

        next = next.nextSibling.nextSibling
        if next and isinstance(next, Tag) and next.get('property', '').startswith('og:image:'):
            dimension = next['content']
            prop = next.get('property').rsplit(':')[-1]
            image[prop] = dimension

    images.append(image)

lbolla · Accepted Answer · 2012-08-15 10:09:52Z

Yours is not a parsing problem, but a list processing one. You want to "group" a list like this:

[u'http://example.com/rock.jpg', u'300', u'300', u'http://example.com/rock2.jpg', u'http://example.com/rock3.jpg', u'1000']

Into something like this:

[[u'http://example.com/rock.jpg', u'300', u'300'], [u'http://example.com/rock2.jpg'], [u'http://example.com/rock3.jpg', u'1000']]

This is my solution:

import BeautifulSoup as BS                                                  


content = '''<meta property="og:image" content="http://example.com/rock.jpg" 
<meta property="og:image:width" content="300" />                            
<meta property="og:image:height" content="300" />                           
<meta property="og:image" content="http://example.com/rock2.jpg" />         
<meta property="og:image" content="http://example.com/rock3.jpg" />         
<meta property="og:image:height" content="1000" />'''                       


soup = BS.BeautifulSoup(content)                                            
data = [m['content'] for m in soup.findAll('meta')]                         

# Grouping                                                                            
images = []                                                                 
current_image = None                                                        
for d in data:                                                              
    if d.startswith('http'):                                                
        if current_image:                                                   
            images.append(current_image)                                    
        current_image = [d]                                                 
    else:                                                                   
        if current_image:                                                   
            current_image.append(d)                                         
        else:                                                               
            raise Exception('error')                                        
images.append(current_image)                                                

print data                                                                  
print images

You are right, it was mainly a grouping issue than a parsing one. But I wanted something which uses beautifulsoup tree structure and takes advantage of it. Thanks for the code!

Collectives™ on Stack Overflow

How to parse array of HTML tags with python beautifulsoup

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related