Removing duplicate URLs in Python including URLs that contain a forward slash

Question

The following program is giving me output that includes URLs with and without the forward slash (e.g. ask.census.gov and ask.census.gov/). I need to eliminate one or the other. Thank you in advance for your help!

from bs4 import BeautifulSoup as mySoup
from urllib.parse import urljoin as myJoin
from urllib.request import urlopen as myRequest

my_url = "https://www.census.gov/programs-surveys/popest.html"

# call on packages
html_page = myRequest(my_url)
raw_html = html_page.read()
html_page.close()
page_soup = mySoup(raw_html, "html.parser")

f = open("censusTest.csv", "w")

hyperlinks = page_soup.findAll('a')

set_urls = set()

for checked in hyperlinks:
    found_link = checked.get("href")
    result_set = myJoin(my_url, found_link)
    if result_set and result_set not in set_urls:
        set_urls.add(result_set)
        f.write(str(result_set) + "\n")

f.close()

alecxe · Accepted Answer · 2017-12-12 19:41:39Z

1

You can always right-strip the slash - it would be removed if exists and nothing will be done if not:

result_set = myJoin(my_url, found_link).rstrip("/")

answered Dec 12, 2017 at 19:41

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Zac · Accepted Answer · 2017-12-12 19:44:00Z

0

my_url = "https://www.census.gov/programs-surveys/popest.html/"
if my_url[-1:] == '/':
    my_url = my_url[:-1]

This snip of code will check to see if the last character in your string is a '/', and if it is, it will delete it.

Good examples of python string manipulation: http://www.pythonforbeginners.com/basics/string-manipulation-in-python

edited Dec 12, 2017 at 19:44

answered Dec 12, 2017 at 19:39

Zac

314 bronze badges

2 Comments

alecxe Over a year ago

Would not my_url become equal to / after executing this code?

Zac Over a year ago

alecxe is right, I have fixed my mistake. Thank you

Collectives™ on Stack Overflow

Removing duplicate URLs in Python including URLs that contain a forward slash

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related