1

I have a web index view of a folder...

<ul><li><a href="/sustainabilitymedia/pics/s5/"> Parent Directory</a></li> 
<li><a href="n150850_.jpg"> n150850_.jpg</a></li> 
<li><a href="n150850_ss.jpg"> n150850_ss.jpg</a></li> 
<li><a href="n150850q.jpg"> n150850q.jpg</a></li> 
<li><a href="n150858_.jpg"> n150858_.jpg</a></li> 
<li><a href="n150858_ss.jpg"> n150858_ss.jpg</a></li> 
<li><a href="n150858q.jpg"> n150858q.jpg</a></li> 
<li><a href="n150906_.jpg"> n150906_.jpg</a></li> 
<li><a href="n150906_ss.jpg"> n150906_ss.jpg</a></li>
...

The list goes on and on and on. My goal is to grab only the list items ending in _ss.jpg so that I can render out my results and display them nicely on a page for presentation.

I can grab the page with BeautifulSoup but from there, im not sure how to filter out only list items matching a particular pattern. The page is behind Basic Auth which I have solved in a previous question regarding BeautifulSoup. Im happy to not use it either.

Any ideas?

1
  • I guess another way of approaching this problem is somehow grabbing the filename with OUT the difference and then apply each difference to generate lists of each type(?)... Commented Nov 24, 2010 at 20:29

3 Answers 3

6

You can do a findAll() using a regex, for example soup_object.findAll('a', {'href': re.compile('.*_ss\.jpg')}).

Sign up to request clarification or add additional context in comments.

5 Comments

Wow. That was fast, Thank you, Brent!
Im trying to write a pattern for the other files... I noticed that some end in a letter but i wish to exclude _ss in the resulting set. Could you help me figure out this pattern? soup.findAll('a', {'href': re.compile('.*[a-z]\.jpg')}) Is what I came up with but it's including the _ss, I guess I need to find a way to "exclude" any files with _ in it?
You can do this with a negative lookbehind: soup_object.findAll('a', {'href': re.compile('.*(?<!_ss)\.jpg')}). This essentially says "Don't give me anything with _ss right before the .jpg."
Nice. This is all going in my diary file. I dug up a few regex cheat sheet and none of them mentioned this Assertion. Thank you very much for teaching me this.
Raw string literals are always recommended for re's: r'.*_ss\.jpg'
1

Brent's exactly right; +1 to him for being so fast.

I had already worked out an example so I figured I'd just post anyway (no need to vote on this):

>>> from BeautifulSoup import BeautifulSoup as bs
>>> from pprint import pprint
>>> import re
>>> markup = '''
... <ul><li><a href="/sustainabilitymedia/pics/s5/"> Parent Directory</a></li>
... <li><a href="n150850_.jpg"> n150850_.jpg</a></li>
... <li><a href="n150850_ss.jpg"> n150850_ss.jpg</a></li>
... <li><a href="n150850q.jpg"> n150850q.jpg</a></li>
... <li><a href="n150858_.jpg"> n150858_.jpg</a></li>
... <li><a href="n150858_ss.jpg"> n150858_ss.jpg</a></li>
... <li><a href="n150858q.jpg"> n150858q.jpg</a></li>
... <li><a href="n150906_.jpg"> n150906_.jpg</a></li>
... <li><a href="n150906_ss.jpg"> n150906_ss.jpg</a></li>'''
>>> soup = bs(markup)
>>> pprint(soup.findAll(href=re.compile('_ss[.]jpg$')))
[<a href="n150850_ss.jpg"> n150850_ss.jpg</a>,
 <a href="n150858_ss.jpg"> n150858_ss.jpg</a>,
 <a href="n150906_ss.jpg"> n150906_ss.jpg</a>]

Happy Thanksgiving to those who celebrate it.

1 Comment

Thanks for the additional example Adam. I will likely use this as well when constructing my view. I better go answer a few questions today to pay it forward.
0

i would use something like

data = data.split("\n")
data = filter(x : x.find("_ss.jpg") >= 0,data)
data = map(lambda x: re.match("(?<=<href=)\".*_ss\.jpg\"(?=>)",x),data)

this should produce a list of the names ending with _ss.jpg .

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.