Finding urls containing a specific string

Question

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.

I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.

The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.

I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?

For example, here are two of the urls I want to grab in different webpages:

http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx

http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx

Welcome to Stack Overflow! We encourage you to research your questions. If you've tried something already, please add it to the question - if not, research and attempt your question first, and then come back. — user647772
– user647772, Commented Oct 30, 2012 at 13:36
Thanks, Ticho. I've done resarch on RegEx but never actually used it. I'm asking the question, because I couldn't find a way to solve the problem using BeautifulSoup. — kabp
– kabp, Commented Oct 30, 2012 at 13:55

CoffeeRain · Accepted Answer · 2012-10-30 13:47:07Z

2

This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.

import re
for item in listofurls:
  l = re.findall("uge\d\d?", item, re.IGNORECASE):
  if l:
    print item #just do whatever you want to do when it finds it

answered Oct 30, 2012 at 13:47

CoffeeRain

4,5304 gold badges34 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kabp Over a year ago

Thanks for the reply, Coffee. It looks right but I've been fiddling around to no luck. Will give it another shot later.

Zero Piraeus · Accepted Answer · 2012-10-30 14:45:09Z

1

Yes, you can do this with BeautifulSoup.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]

edited Oct 30, 2012 at 14:45

answered Oct 30, 2012 at 14:37

Zero Piraeus

59.7k28 gold badges158 silver badges164 bronze badges

3 Comments

kabp Over a year ago

Wow, that works wonderfully. I get a list of the urls, that I want. Thanks a bunch! Any idea, why alle the urls have a u' in front of them - example: u'domstol.dk/KobenhavnsByret/retslister/Pages/…'

kabp Over a year ago

I 'fixed' it with some simple string formatting, but I'm still not sure why the 'u' is added.

Willy Over a year ago

@kabp the u in front of the string means that it is a unicode string ;)

Willy · Accepted Answer · 2012-10-30 13:45:08Z

1

Or just use a simple for loop:

list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
    if 'uge' in url.lower():
        # Code to execute

The regex expression would look something like: uge\d\d

edited Oct 30, 2012 at 13:45

answered Oct 30, 2012 at 13:36

Willy

6358 silver badges18 bronze badges

2 Comments

Willy Over a year ago

Note that the foor loop does not look for a number behind "uge".

kabp Over a year ago

Thanks a bunch, Willy. Much appreciated even though I went with another method.

Collectives™ on Stack Overflow

Finding urls containing a specific string

3 Answers 3

1 Comment

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related