Add http://xxx.xx/ to a string using python

Question

How can I add "http://test.url/" to the result link.get('href') below, but only if it doesn't contain "http"

import urllib2
from bs4 import BeautifulSoup

url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
for link in soup.findAll('a'):
  print link.get('href')

falsetru · Accepted Answer · 2013-12-07 09:17:26Z

6

Use urlparse.urljoin:

>>> import urlparse
>>> urlparse.urljoin('http://example.com/', '/a/b')
'http://example.com/a/b'
>>> urlparse.urljoin('http://example.com/', 'http://www.example.com/a/b')
'http://www.example.com/a/b'

In Python 3.x, use urllib.parse.urljoin:

>>> import urllib.parse
>>> urllib.parse.urljoin('http://example.com/', '/a/b')
'http://example.com/a/b'
>>> urllib.parse.urljoin('http://example.com/', 'http://www.example.com/a/b')
'http://www.example.com/a/b'

Complete example

import urllib2
from bs4 import BeautifulSoup
import urlparse

url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
for link in soup.findAll('a'):
    print urlparse.urljoin(url1, link.get('href'))

edited Dec 7, 2013 at 9:17

answered Dec 7, 2013 at 9:11

falsetru

371k69 gold badges770 silver badges660 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

moliware · Accepted Answer · 2013-12-07 09:17:18Z

4

I would use urljoin

>>> from urlparse import urljoin
>>> urljoin('http://test.url/', '/relative/path')
'http://test.url/relative/path'

In your example you only need to do this when you find a relative url.

edited Dec 7, 2013 at 9:17

answered Dec 7, 2013 at 9:11

moliware

10.4k3 gold badges39 silver badges48 bronze badges

Comments

Matt · Accepted Answer · 2013-12-07 09:33:02Z

1

import urllib2
from bs4 import BeautifulSoup

url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
for link in soup.findAll('a'):
   get = link.get('href')
   if get.startswith('http'):
      print get

In the spirit of BeautifulSoup, this works well with your original code.

If what you want is to preface the non-http sites with a http://test.url/ then you need to do this:

for link in soup.findAll('a'):
   get = link.get('href')
   if not get.startswith('http'):
      print 'http://test.url/'+get

You're set either way.

edited Dec 7, 2013 at 9:33

answered Dec 7, 2013 at 9:15

Matt

3,5676 gold badges45 silver badges66 bronze badges

Comments

thefourtheye · Accepted Answer · 2013-12-07 09:10:33Z

0

for link in soup.findAll('a'):
    currenturl =  link.get('href')
    if not currenturl.startswith("http"):
        currenturl = "http://test.url/" + currenturl
    print currenturl

answered Dec 7, 2013 at 9:10

thefourtheye

241k53 gold badges466 silver badges505 bronze badges

Comments

Totem · Accepted Answer · 2013-12-07 09:33:51Z

0

This works:

    url = "http://test.url/" 

    link_list = [link['href'] for link in soup.findAll('a')]

    result_list = [url+i if 'http' not in i else i for i in link_list]

    for link in result_link:
        print link

edited Dec 7, 2013 at 9:33

answered Dec 7, 2013 at 9:27

Totem

7,3795 gold badges43 silver badges67 bronze badges

Collectives™ on Stack Overflow

Add http://xxx.xx/ to a string using python

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related