I have made a function to scrape websites. The function scrapes the website and fetches url inside a website.
print links #scrape() http://www.web1.to/something http://www.web2.gov.uk/something http://www.web3.com/something http://www.web4.com/something http://www.web5.com/something http://www.web6.com/something
while fetching it also retrieves unnecessary sites links or with strings .rdf which i want to remove.
def scrape()
.
.
links = re.findall('href="(http.*?)"', sourceCode)
for link in set(links):
if 'web1.to' in link:
pass
elif 'web2.gov.' in link:
pass
elif '.rdf' in link:
pass
else:
return link
#print link; #it seems to work(*)
#this section which call scrape function and prints the links
for web in scrape():
print web
time.sleep(1)
I have created this function which seems to work if i use print inside the scrape function(see the commented line #print link). But when I called it outside it only returns one url
http://www.web6.com/something
I then removed the for loop
if 'web1.to' in link:
pass
elif 'web2.gov.' in link:
pass
elif 'web3.com' in link:
pass
else:
return link
used this modifed function to print from outside. The conditions I given here doesn't work and it prints all the websites.
I know i have made some logical error in codeing but I dont see it. can you help me