How to remove url links with specific domain name or strings

Question

I have made a function to scrape websites. The function scrapes the website and fetches url inside a website.

print links      #scrape()
http://www.web1.to/something
http://www.web2.gov.uk/something
http://www.web3.com/something
http://www.web4.com/something
http://www.web5.com/something
http://www.web6.com/something

while fetching it also retrieves unnecessary sites links or with strings .rdf which i want to remove.

  def scrape()
    .
    .
            links = re.findall('href="(http.*?)"', sourceCode)

            for link in set(links):                         
                if 'web1.to' in link:
                    pass
                elif 'web2.gov.' in link:
                    pass
                elif '.rdf' in link:
                    pass
                else:                       
                    return link
                    #print link; #it seems to work(*)

#this section which call scrape function and prints the links   
for web in scrape():
    print web
    time.sleep(1)

I have created this function which seems to work if i use print inside the scrape function(see the commented line #print link). But when I called it outside it only returns one url

http://www.web6.com/something

I then removed the for loop

            if 'web1.to' in link:
                pass
            elif 'web2.gov.' in link:
                pass
            elif 'web3.com' in link:
                pass
            else:                       
                return link

used this modifed function to print from outside. The conditions I given here doesn't work and it prints all the websites.

I know i have made some logical error in codeing but I dont see it. can you help me

Pep_8_Guardiola · Accepted Answer · 2016-03-17 11:32:02Z

1

Your function is returning the first valid link it finds. Try adding a new list at the top of your scrape function:

valid = []

Every time you find a valid link, append it to your valid links list:

valid.append(link)

When you have finished checking all links, then return your whole list:

return valid

Try something like this:

valid = []
for link in set(links):
    if 'web1.to' in link:
        pass
    elif 'web2.gov.' in link:
        pass
    elif '.rdf' in link:
        pass
    else:                       
        valid.append(link)

return valid

edited Mar 17, 2016 at 11:32

answered Mar 17, 2016 at 11:13

Pep_8_Guardiola

5,3021 gold badge26 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Eka Over a year ago

not working. it dose the same thing only prints one link.

Pep_8_Guardiola Over a year ago

@Eka Are you sure you got your indentation correct? Make sure that you return after your for loop has finished, not right after your append a valid link. I'll add an example now

Eka Over a year ago

You got it I made an indentation error. I put return under else statement. Thank you :)

Pratik Gujarathi · Accepted Answer · 2016-03-17 11:14:11Z

0

Do this :

def scrape()
    .
    .
            links = re.findall('href="(http.*?)"', sourceCode)
            return links

links =  scrape()
for link in links:
    if 'web1.to' in link:
        pass
    elif 'web2.gov.' in link:
        pass
    elif 'web3.com' in link:
        pass
    else:                       
        print link

Case 2 :

You have removed for loop from inside and now trying to access "link" to check various conditions but link is not defined and hence you are getting error.

answered Mar 17, 2016 at 11:14

Pratik Gujarathi

9651 gold badge11 silver badges21 bronze badges

Collectives™ on Stack Overflow

How to remove url links with specific domain name or strings

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related