0

I am trying to scrape a webpage just for learning. In that webpage there are multiple "a" tags. consider the below code

<a href='\abc\def\jkl'> Something </a>
<a href ='http://www.google.com'> Something</a>

Now i want to read only those href attributes in which there is http. My Current code is

for link in soup.find_all("a"):
    print link.get("href")

I would like to change it to read only "http" links.

4 Answers 4

2

Can do it with regex like this:

import re
from bs4 import BeautifulSoup

res = """<a href="\abc\def\jkl">Something</a>
<a href="http://www.google.com">something</a>"""

soup = BeautifulSoup(res)
print soup.find_all('a', {'href' : re.compile('^http:.*')})

Output:

[<a href="http://www.google.com">something</a>]
Sign up to request clarification or add additional context in comments.

Comments

2

You can also use the "starts with" CSS selector:

print([a["href"] for a in soup.select('a[href^=http]')])

Demo:

In [1]: from bs4 import BeautifulSoup

In [2]: res = """
   ...: <a href="\abc\def\jkl">Something</a>
   ...: <a href="http://www.google.com">something</a>
   ...: """

In [3]: soup = BeautifulSoup(res, "html.parser")

In [4]: print([a["href"] for a in soup.select('a[href^=http]')])
[u'http://www.google.com']

Comments

1

Just run this simple test to see if the link contains the string http. One extra line is required in your code to do this:

for link in soup.find_all('a'):
    if 'http' in link.get('href'):
        print(link.get('href'))

Comments

0

Another way to do this:

for link in soup.find_all("a"):
    if 'http' in link['href']:       
        print link['href']          

Here link['href'] will get all text within href tag.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.