Python url data extraction with regex [duplicate]

Question

I'd like to retrieve the content and href link from an HTML tag in Python.

I'm a beginner in regex and am able to retrieve the href content in this way:

urls = re.findall('<a class="title" href="(.*?)" title', page)

When trying to extract tag's content as well, I get nothing.

urls = re.findall('<a class="title" href="(.*?)" title>(.*?)</a>', page)

How to do it the right way?

Thanks in advance.

Did you try using BeautifulSoup?..pypi.python.org/pypi/BeautifulSoup — Iron Fist
– Iron Fist, Commented Dec 29, 2015 at 22:29
@KamyarGhasemlou it's not because there, it doesn't care about tag's content. — aajjbb
– aajjbb, Commented Dec 29, 2015 at 22:31
Is using an html parser feasible for a small snippet like this one ? — aajjbb
– aajjbb, Commented Dec 29, 2015 at 22:31
do you mean the url with tag's content?(normally, <a...> is the tag, so I got a bit confused by your answer) — Kamyar Ghasemlou
– Kamyar Ghasemlou, Commented Dec 29, 2015 at 22:35

Turn · Accepted Answer · 2015-12-29 22:29:10Z

4

The right way to do this is use a parser like Beautiful Soup. Trying to parse HTML with regexes is hell and you won't get very far before you hit a wall.

answered Dec 29, 2015 at 22:29

Turn

7,11035 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Klaus · Accepted Answer · 2015-12-29 22:49:01Z

2

That worked for me to get the URLs from heise.de:

urls = re.findall('<a .*?href="(.*?)".*?>', page)

Perhaps you can express that also simpler.

To retrieve also the Tag content:

urls = re.findall('<a .*?href="(.*?)".*?>(.*?)</a>', page)

I really do not know what this second title does in your regex, perhaps you can also give an example text which does not match. Then I can give you a better answer why your regex does not work

edited Dec 29, 2015 at 22:49

answered Dec 29, 2015 at 22:35

Klaus

4765 silver badges9 bronze badges

Collectives™ on Stack Overflow

Python url data extraction with regex [duplicate]

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related