0

I have made a function to scrape websites. The function scrapes the website and fetches url inside a website.

print links      #scrape()
http://www.web1.to/something
http://www.web2.gov.uk/something
http://www.web3.com/something
http://www.web4.com/something
http://www.web5.com/something
http://www.web6.com/something

while fetching it also retrieves unnecessary sites links or with strings .rdf which i want to remove.

  def scrape()
    .
    .
            links = re.findall('href="(http.*?)"', sourceCode)

            for link in set(links):                         
                if 'web1.to' in link:
                    pass
                elif 'web2.gov.' in link:
                    pass
                elif '.rdf' in link:
                    pass
                else:                       
                    return link
                    #print link; #it seems to work(*)

#this section which call scrape function and prints the links   
for web in scrape():
    print web
    time.sleep(1)

I have created this function which seems to work if i use print inside the scrape function(see the commented line #print link). But when I called it outside it only returns one url

http://www.web6.com/something

I then removed the for loop

            if 'web1.to' in link:
                pass
            elif 'web2.gov.' in link:
                pass
            elif 'web3.com' in link:
                pass
            else:                       
                return link

used this modifed function to print from outside. The conditions I given here doesn't work and it prints all the websites.

I know i have made some logical error in codeing but I dont see it. can you help me

2 Answers 2

1

Your function is returning the first valid link it finds. Try adding a new list at the top of your scrape function:

valid = []

Every time you find a valid link, append it to your valid links list:

valid.append(link)

When you have finished checking all links, then return your whole list:

return valid

Try something like this:

valid = []
for link in set(links):
    if 'web1.to' in link:
        pass
    elif 'web2.gov.' in link:
        pass
    elif '.rdf' in link:
        pass
    else:                       
        valid.append(link)

return valid
Sign up to request clarification or add additional context in comments.

3 Comments

not working. it dose the same thing only prints one link.
@Eka Are you sure you got your indentation correct? Make sure that you return after your for loop has finished, not right after your append a valid link. I'll add an example now
You got it I made an indentation error. I put return under else statement. Thank you :)
0

Do this :

def scrape()
    .
    .
            links = re.findall('href="(http.*?)"', sourceCode)
            return links

links =  scrape()
for link in links:
    if 'web1.to' in link:
        pass
    elif 'web2.gov.' in link:
        pass
    elif 'web3.com' in link:
        pass
    else:                       
        print link

Case 2 :

You have removed for loop from inside and now trying to access "link" to check various conditions but link is not defined and hence you are getting error.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.