0

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.

I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.

The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.

I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?

For example, here are two of the urls I want to grab in different webpages:

http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx

http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx

2
  • 1
    Welcome to Stack Overflow! We encourage you to research your questions. If you've tried something already, please add it to the question - if not, research and attempt your question first, and then come back. Commented Oct 30, 2012 at 13:36
  • Thanks, Ticho. I've done resarch on RegEx but never actually used it. I'm asking the question, because I couldn't find a way to solve the problem using BeautifulSoup. Commented Oct 30, 2012 at 13:55

3 Answers 3

2

This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.

import re
for item in listofurls:
  l = re.findall("uge\d\d?", item, re.IGNORECASE):
  if l:
    print item #just do whatever you want to do when it finds it
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the reply, Coffee. It looks right but I've been fiddling around to no luck. Will give it another shot later.
1

Yes, you can do this with BeautifulSoup.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]

3 Comments

Wow, that works wonderfully. I get a list of the urls, that I want. Thanks a bunch! Any idea, why alle the urls have a u' in front of them - example: u'domstol.dk/KobenhavnsByret/retslister/Pages/…'
I 'fixed' it with some simple string formatting, but I'm still not sure why the 'u' is added.
@kabp the u in front of the string means that it is a unicode string ;)
1

Or just use a simple for loop:

list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
    if 'uge' in url.lower():
        # Code to execute

The regex expression would look something like: uge\d\d

2 Comments

Note that the foor loop does not look for a number behind "uge".
Thanks a bunch, Willy. Much appreciated even though I went with another method.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.