0

The following program is giving me output that includes URLs with and without the forward slash (e.g. ask.census.gov and ask.census.gov/). I need to eliminate one or the other. Thank you in advance for your help!

from bs4 import BeautifulSoup as mySoup
from urllib.parse import urljoin as myJoin
from urllib.request import urlopen as myRequest

my_url = "https://www.census.gov/programs-surveys/popest.html"

# call on packages
html_page = myRequest(my_url)
raw_html = html_page.read()
html_page.close()
page_soup = mySoup(raw_html, "html.parser")

f = open("censusTest.csv", "w")

hyperlinks = page_soup.findAll('a')

set_urls = set()

for checked in hyperlinks:
    found_link = checked.get("href")
    result_set = myJoin(my_url, found_link)
    if result_set and result_set not in set_urls:
        set_urls.add(result_set)
        f.write(str(result_set) + "\n")

f.close()

2 Answers 2

1

You can always right-strip the slash - it would be removed if exists and nothing will be done if not:

result_set = myJoin(my_url, found_link).rstrip("/")
Sign up to request clarification or add additional context in comments.

Comments

0
my_url = "https://www.census.gov/programs-surveys/popest.html/"
if my_url[-1:] == '/':
    my_url = my_url[:-1]

This snip of code will check to see if the last character in your string is a '/', and if it is, it will delete it.

Good examples of python string manipulation: http://www.pythonforbeginners.com/basics/string-manipulation-in-python

2 Comments

Would not my_url become equal to / after executing this code?
alecxe is right, I have fixed my mistake. Thank you

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.