Python Web Scraping: Within href read only those values that have "http" in it

Question

I am trying to scrape a webpage just for learning. In that webpage there are multiple "a" tags. consider the below code

<a href='\abc\def\jkl'> Something </a>
<a href ='http://www.google.com'> Something</a>

Now i want to read only those href attributes in which there is http. My Current code is

for link in soup.find_all("a"):
    print link.get("href")

I would like to change it to read only "http" links.

Mohammad Yusuf · Accepted Answer · 2017-01-14 04:04:37Z

2

Can do it with regex like this:

import re
from bs4 import BeautifulSoup

res = """<a href="\abc\def\jkl">Something</a>
<a href="http://www.google.com">something</a>"""

soup = BeautifulSoup(res)
print soup.find_all('a', {'href' : re.compile('^http:.*')})

Output:

[<a href="http://www.google.com">something</a>]

answered Jan 14, 2017 at 4:04

Mohammad Yusuf

17.1k12 gold badges60 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

alecxe · Accepted Answer · 2017-01-14 04:06:59Z

2

You can also use the "starts with" CSS selector:

print([a["href"] for a in soup.select('a[href^=http]')])

Demo:

In [1]: from bs4 import BeautifulSoup

In [2]: res = """
   ...: <a href="\abc\def\jkl">Something</a>
   ...: <a href="http://www.google.com">something</a>
   ...: """

In [3]: soup = BeautifulSoup(res, "html.parser")

In [4]: print([a["href"] for a in soup.select('a[href^=http]')])
[u'http://www.google.com']

answered Jan 14, 2017 at 4:06

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Comments

nipy · Accepted Answer · 2017-01-14 05:42:18Z

1

Just run this simple test to see if the link contains the string http. One extra line is required in your code to do this:

for link in soup.find_all('a'):
    if 'http' in link.get('href'):
        print(link.get('href'))

edited Jan 14, 2017 at 5:42

answered Jan 14, 2017 at 5:23

nipy

5,5485 gold badges37 silver badges84 bronze badges

Comments

Shashank · Accepted Answer · 2017-01-14 05:48:50Z

0

Another way to do this:

for link in soup.find_all("a"):
    if 'http' in link['href']:       
        print link['href']

Here link['href'] will get all text within href tag.

answered Jan 14, 2017 at 5:48

Shashank

1,1452 gold badges23 silver badges36 bronze badges

Collectives™ on Stack Overflow

Python Web Scraping: Within href read only those values that have "http" in it

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related