Writing a Python RegEx to select a sub-set of list items in HTML

Question

I have a web index view of a folder...

<ul><li><a href="/sustainabilitymedia/pics/s5/"> Parent Directory</a></li> 
<li><a href="n150850_.jpg"> n150850_.jpg</a></li> 
<li><a href="n150850_ss.jpg"> n150850_ss.jpg</a></li> 
<li><a href="n150850q.jpg"> n150850q.jpg</a></li> 
<li><a href="n150858_.jpg"> n150858_.jpg</a></li> 
<li><a href="n150858_ss.jpg"> n150858_ss.jpg</a></li> 
<li><a href="n150858q.jpg"> n150858q.jpg</a></li> 
<li><a href="n150906_.jpg"> n150906_.jpg</a></li> 
<li><a href="n150906_ss.jpg"> n150906_ss.jpg</a></li>
...

The list goes on and on and on. My goal is to grab only the list items ending in _ss.jpg so that I can render out my results and display them nicely on a page for presentation.

I can grab the page with BeautifulSoup but from there, im not sure how to filter out only list items matching a particular pattern. The page is behind Basic Auth which I have solved in a previous question regarding BeautifulSoup. Im happy to not use it either.

Any ideas?

I guess another way of approaching this problem is somehow grabbing the filename with OUT the difference and then apply each difference to generate lists of each type(?)... — Ben Keating
– Ben Keating, Commented Nov 24, 2010 at 20:29

Brent Newey · Accepted Answer · 2010-11-24 20:36:18Z

6

You can do a findAll() using a regex, for example soup_object.findAll('a', {'href': re.compile('.*_ss\.jpg')}).

answered Nov 24, 2010 at 20:36

Brent Newey

4,5193 gold badges31 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ben Keating Over a year ago

Wow. That was fast, Thank you, Brent!

Ben Keating Over a year ago

Im trying to write a pattern for the other files... I noticed that some end in a letter but i wish to exclude _ss in the resulting set. Could you help me figure out this pattern? soup.findAll('a', {'href': re.compile('.*[a-z]\.jpg')}) Is what I came up with but it's including the _ss, I guess I need to find a way to "exclude" any files with _ in it?

Brent Newey Over a year ago

You can do this with a negative lookbehind: soup_object.findAll('a', {'href': re.compile('.*(?<!_ss)\.jpg')}). This essentially says "Don't give me anything with _ss right before the .jpg."

Ben Keating Over a year ago

Nice. This is all going in my diary file. I dug up a few regex cheat sheet and none of them mentioned this Assertion. Thank you very much for teaching me this.

PaulMcG Over a year ago

Raw string literals are always recommended for re's: r'.*_ss\.jpg'

mechanical_meat · Accepted Answer · 2010-11-24 20:43:50Z

1

Brent's exactly right; +1 to him for being so fast.

I had already worked out an example so I figured I'd just post anyway (no need to vote on this):

>>> from BeautifulSoup import BeautifulSoup as bs
>>> from pprint import pprint
>>> import re
>>> markup = '''
... <ul><li><a href="/sustainabilitymedia/pics/s5/"> Parent Directory</a></li>
... <li><a href="n150850_.jpg"> n150850_.jpg</a></li>
... <li><a href="n150850_ss.jpg"> n150850_ss.jpg</a></li>
... <li><a href="n150850q.jpg"> n150850q.jpg</a></li>
... <li><a href="n150858_.jpg"> n150858_.jpg</a></li>
... <li><a href="n150858_ss.jpg"> n150858_ss.jpg</a></li>
... <li><a href="n150858q.jpg"> n150858q.jpg</a></li>
... <li><a href="n150906_.jpg"> n150906_.jpg</a></li>
... <li><a href="n150906_ss.jpg"> n150906_ss.jpg</a></li>'''
>>> soup = bs(markup)
>>> pprint(soup.findAll(href=re.compile('_ss[.]jpg$')))
[<a href="n150850_ss.jpg"> n150850_ss.jpg</a>,
 <a href="n150858_ss.jpg"> n150858_ss.jpg</a>,
 <a href="n150906_ss.jpg"> n150906_ss.jpg</a>]

Happy Thanksgiving to those who celebrate it.

answered Nov 24, 2010 at 20:43

mechanical_meat

170k25 gold badges238 silver badges231 bronze badges

1 Comment

Ben Keating Over a year ago

Thanks for the additional example Adam. I will likely use this as well when constructing my view. I better go answer a few questions today to pay it forward.

kasten · Accepted Answer · 2010-11-24 20:47:42Z

0

i would use something like

data = data.split("\n")
data = filter(x : x.find("_ss.jpg") >= 0,data)
data = map(lambda x: re.match("(?<=<href=)\".*_ss\.jpg\"(?=>)",x),data)

this should produce a list of the names ending with _ss.jpg .

answered Nov 24, 2010 at 20:47

kasten

6282 gold badges6 silver badges18 bronze badges

Collectives™ on Stack Overflow

Writing a Python RegEx to select a sub-set of list items in HTML

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related