3

How can I parse with python beautifulsoup the following code? I need to get each image with its corresponding width and height properties (if they exist).

The code below "means there are 3 images on this page, the first image is 300x300, the middle one has unspecified dimensions, and the last one is 1000px tall" (as explained here)

<meta property="og:image" content="http://example.com/rock.jpg" />
<meta property="og:image:width" content="300" />
<meta property="og:image:height" content="300" />
<meta property="og:image" content="http://example.com/rock2.jpg" />
<meta property="og:image" content="http://example.com/rock3.jpg" />
<meta property="og:image:height" content="1000" />

So far I have the following code, but it only returns the first set of dimensions:

images = []
img_list = soup.findAll('meta', {"property":'og:image'})
for og_image in img_list:
    if not og_image.get('content'):
        continue

    image = {'url': og_image['content']}

    width = self.soup.find('meta', {"property":'og:image:width'})
    if width:
        image['width'] = width['content']
    height = self.soup.find('meta', {"property":'og:image:height'})
    if width:
        image['height'] = height['content']

    images.append(image)

Thanks!

2
  • as always; what have you tried? There's a ton of examples out there... Commented Aug 15, 2012 at 9:42
  • I have updated the question with the code I have tried. I am having troubles matching the image with its dimensions. Commented Aug 15, 2012 at 9:48

3 Answers 3

2

It's not BeautifulSoup, but a pyparsing approach is pretty quick to knock together:

html = """
<meta property="og:image" content="http://example.com/rock.jpg" /> 
<meta property="og:image:width" content="300" /> 
<meta property="og:image:height" content="300" /> 
<meta property="og:image" content="http://example.com/rock2.jpg" /> 
<meta property="og:image" content="http://example.com/rock3.jpg" /> 
<meta property="og:image:height" content="1000" /> 
"""

from pyparsing import makeHTMLTags, withAttribute, Optional, Group

# use makeHTMLTags to define tag expressions (allows attributes, whitespace, 
# closing '/', etc., and sets up results names for matched attributes so they
# are easy to get at later)
meta,metaEnd = makeHTMLTags("meta")

# define a copy of the opening tag, filtering on the specific attribute to select for
img_meta = meta.copy().setParseAction(withAttribute(('property','og:image')))
wid_meta = meta.copy().setParseAction(withAttribute(('property','og:image:width')))
hgt_meta = meta.copy().setParseAction(withAttribute(('property','og:image:height')))

# now define the overall expression to look for, and assign names for subexpressions
# for width and height
img_ref = img_meta + Optional(Group(wid_meta)("width")) + Optional(Group(hgt_meta)("height"))

# use searchString to scan through the given text looking for matches
for img in img_ref.searchString(html):
    print "IMAGE:", img.content
    if img.height:
        print "H:", img.height.content
    if img.width:
        print "W:", img.width.content
    print

Prints:

IMAGE: http://example.com/rock.jpg
H: 300
W: 300

IMAGE: http://example.com/rock2.jpg

IMAGE: http://example.com/rock3.jpg
H: 1000
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot, it's pretty straight through with pyparsing! The problem is that I am already using beautifulsoup for the rest of the code, and it would take too much time for another parser to load and parse.
2

I want something fast, which uses beautifulsoup tree structure. Here is the solution I found suitable, in case there are people looking for something similar:

from BeautifulSoup import BeautifulSoup, Tag

soup = BeautifulSoup(html)
images = []
image = {}

img_list = soup.findAll('meta', {"property":'og:image'})
for og_image in img_list:
    if not og_image.get('content'):
        continue

    image = {'url': og_image['content']}
    next = og_image.nextSibling.nextSibling # calling once returns end of line char '\n'

    if next and isinstance(next, Tag) and next.get('property', '').startswith('og:image:'):
        dimension = next['content']
        prop = next.get('property').rsplit(':')[-1]
        image[prop] = dimension

        next = next.nextSibling.nextSibling
        if next and isinstance(next, Tag) and next.get('property', '').startswith('og:image:'):
            dimension = next['content']
            prop = next.get('property').rsplit(':')[-1]
            image[prop] = dimension

    images.append(image)

Comments

0

Yours is not a parsing problem, but a list processing one. You want to "group" a list like this:

[u'http://example.com/rock.jpg', u'300', u'300', u'http://example.com/rock2.jpg', u'http://example.com/rock3.jpg', u'1000']

Into something like this:

[[u'http://example.com/rock.jpg', u'300', u'300'], [u'http://example.com/rock2.jpg'], [u'http://example.com/rock3.jpg', u'1000']]

This is my solution:

import BeautifulSoup as BS                                                  


content = '''<meta property="og:image" content="http://example.com/rock.jpg" 
<meta property="og:image:width" content="300" />                            
<meta property="og:image:height" content="300" />                           
<meta property="og:image" content="http://example.com/rock2.jpg" />         
<meta property="og:image" content="http://example.com/rock3.jpg" />         
<meta property="og:image:height" content="1000" />'''                       


soup = BS.BeautifulSoup(content)                                            
data = [m['content'] for m in soup.findAll('meta')]                         

# Grouping                                                                            
images = []                                                                 
current_image = None                                                        
for d in data:                                                              
    if d.startswith('http'):                                                
        if current_image:                                                   
            images.append(current_image)                                    
        current_image = [d]                                                 
    else:                                                                   
        if current_image:                                                   
            current_image.append(d)                                         
        else:                                                               
            raise Exception('error')                                        
images.append(current_image)                                                

print data                                                                  
print images                                                                

1 Comment

You are right, it was mainly a grouping issue than a parsing one. But I wanted something which uses beautifulsoup tree structure and takes advantage of it. Thanks for the code!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.