0

I'd like to retrieve the content and href link from an HTML tag in Python.

I'm a beginner in regex and am able to retrieve the href content in this way:

urls = re.findall('<a class="title" href="(.*?)" title', page)

When trying to extract tag's content as well, I get nothing.

urls = re.findall('<a class="title" href="(.*?)" title>(.*?)</a>', page)

How to do it the right way?

Thanks in advance.

5
  • 3
    Doing this 'the right way' is to use a HTML parser. Commented Dec 29, 2015 at 22:27
  • Did you try using BeautifulSoup?..pypi.python.org/pypi/BeautifulSoup Commented Dec 29, 2015 at 22:29
  • @KamyarGhasemlou it's not because there, it doesn't care about tag's content. Commented Dec 29, 2015 at 22:31
  • Is using an html parser feasible for a small snippet like this one ? Commented Dec 29, 2015 at 22:31
  • do you mean the url with tag's content?(normally, <a...> is the tag, so I got a bit confused by your answer) Commented Dec 29, 2015 at 22:35

2 Answers 2

4

The right way to do this is use a parser like Beautiful Soup. Trying to parse HTML with regexes is hell and you won't get very far before you hit a wall.

Sign up to request clarification or add additional context in comments.

Comments

2

That worked for me to get the URLs from heise.de:

urls = re.findall('<a .*?href="(.*?)".*?>', page)

Perhaps you can express that also simpler.

To retrieve also the Tag content:

urls = re.findall('<a .*?href="(.*?)".*?>(.*?)</a>', page)

I really do not know what this second title does in your regex, perhaps you can also give an example text which does not match. Then I can give you a better answer why your regex does not work

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.