0

I need help on removing duplicate URLs in my output. I would try to represent it such that I don't have to put everything in a list, if possible. I feel like it can be achieved with some logical statement, just not sure how to make it happen. Using Python 3.6.

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from urllib.parse import urljoin as join

my_url = 'https://www.census.gov/programs-surveys/popest.html'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

filename = "LinkScraping.csv"
f = open(filename, "w")
headers = "Web_Links\n"
f.write(headers)

links = page_soup.findAll('a')

for link in links:
    web_links = link.get("href")
    ab_url = join(my_url, web_links)
    print(ab_url)
        if ab_url:
        f.write(str(ab_url) + "\n")

f.close()

1 Answer 1

1

You can't achieve this without using any data structure of some sort unless you want to write to the file and re-read it over and over again (which is far less preferable than using an in-memory data structure).

Use a set:

.
.
.

urls_set = set()

for link in links:
    web_links = link.get("href")
    ab_url = join(my_url, web_links)
    print(ab_url)
    if ab_url and ab_url not in urls_set:
        f.write(str(ab_url) + "\n")
        urls_set.add(ab_url)
Sign up to request clarification or add additional context in comments.

3 Comments

With the same idea, a comprehension seems cleaner IMHO: urls_set = set(join(my_url, link.get("href") for link in links) and then you can iterate directly. The order is lost, though.
@MariusSiuram True, but then you lose the order when writing the set's content to file.
@DeepSpace Perfect solution. Not sure why I didn't want to use a list/set, but it is exactly what needed to be done. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.