Removing duplicate URLs in Python (non list)

Question

I need help on removing duplicate URLs in my output. I would try to represent it such that I don't have to put everything in a list, if possible. I feel like it can be achieved with some logical statement, just not sure how to make it happen. Using Python 3.6.

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from urllib.parse import urljoin as join

my_url = 'https://www.census.gov/programs-surveys/popest.html'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

filename = "LinkScraping.csv"
f = open(filename, "w")
headers = "Web_Links\n"
f.write(headers)

links = page_soup.findAll('a')

for link in links:
    web_links = link.get("href")
    ab_url = join(my_url, web_links)
    print(ab_url)
        if ab_url:
        f.write(str(ab_url) + "\n")

f.close()

Yaman Jain · Accepted Answer · 2017-03-06 09:15:05Z

1

You can't achieve this without using any data structure of some sort unless you want to write to the file and re-read it over and over again (which is far less preferable than using an in-memory data structure).

Use a set:

.
.
.

urls_set = set()

for link in links:
    web_links = link.get("href")
    ab_url = join(my_url, web_links)
    print(ab_url)
    if ab_url and ab_url not in urls_set:
        f.write(str(ab_url) + "\n")
        urls_set.add(ab_url)

edited Mar 6, 2017 at 9:15

Yaman Jain

1,24511 silver badges17 bronze badges

answered Mar 6, 2017 at 8:54

DeepSpace

82.1k12 gold badges119 silver badges166 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

MariusSiuram Over a year ago

With the same idea, a comprehension seems cleaner IMHO: urls_set = set(join(my_url, link.get("href") for link in links) and then you can iterate directly. The order is lost, though.

DeepSpace Over a year ago

@MariusSiuram True, but then you lose the order when writing the set's content to file.

T.Donahue Over a year ago

@DeepSpace Perfect solution. Not sure why I didn't want to use a list/set, but it is exactly what needed to be done. Thank you!

Collectives™ on Stack Overflow

Removing duplicate URLs in Python (non list)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related