0

This code giving me duplicate URLs, how do I filter them

sg = []
for url in soup.find_all('a', attrs={'href': re.compile("^https://www.somewebsite")}):
    print(url['href'])
    sg.append(url['href'])
print(sg)
1
  • thanx it did work Commented May 8, 2019 at 13:27

3 Answers 3

1

You can check if url is already inserted on list

sg = []
for url in soup.find_all('a', attrs={'href': re.compile("^https://www.somewebsite")}):
    href = url['href'])
    print(href)
    if href not in sg:
        sg.append(href)
print(sg)
Sign up to request clarification or add additional context in comments.

Comments

0

You can use a set instead of a list

sg = set()
for url in soup.find_all('a', attrs={'href': re.compile("^https://www.somewebsite")}):
    print(url['href'])
    sg.add(url['href'])
print(sg)

Comments

0

Instead of a list, using a set would solve the issue.

sg = set()
for url in soup.find_all('a', attrs={'href': re.compile("^https://www.somewebsite")}):
    print(url['href'])
    sg.add(url['href'])
print(sg)

1 Comment

I think the webpage has two separate NAV, one for desktop another for mobile. Beautifulsoup grabbing both, I guess.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.