1

I am completely new to Python and just trying my coding skills for developing a few programs

I have coded the following program in Python 2.7 to fetch the profile URLs from the dir - http://www.uschirodirectory.com/entire-directory/list/alpha/a.html

However, I am noticing a lot of duplicate entries in the list of URLs fetched. Could someone please review the code and tell me if there's something that I am doing here or is there a way this code could be optimized further.

Many thanks

import requests
from bs4 import BeautifulSoup

def web_crawler(max_pages):
p = '?site='
page = 1
alpha = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
while page <= max_pages:
    for i in alpha:
        url = 'http://www.uschirodirectory.com/entire-directory/list/alpha/' + str(i) + '.html' + p + str(page)
        code = requests.get(url)
        text = code.text
        soup = BeautifulSoup(text)
        for link in soup.findAll('a',{'class':'btn'}):
            href = 'http://www.uschirodirectory.com' + link.get('href')
            print(href)
    page += 1
i += alpha[0 + 1]

#Run the crawler
web_crawler

2 Answers 2

2

Basically your code is ok. You might get lots of duplicate links cause the directory results are designed to issue results not just for 1-st letter in doctor name but also for 1-st letter in company title or other important db field.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for reviewing this for me. I am a Python novice learner and feedback from pro Python programmers like you is immensely valuable to me.
2

You can store the data in a list and also you can remove the duplicate url using this code :

parsedData = []

data = {}

if not any(d['url'] == data['url'] for d in data):

   parsedData.append(data)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.