0

when I can extract the content from list of URLs and then store the contents in text files , the problem is it is my python code is reading only last url link from the text file and store only those contents. Here I am using the goose extraction tool to pull some text from URLs

Can Help me out with this (any problem with for loop here ??)

class FetchUrl(Thread):
    def __init__(self, url, name):
      Thread.__init__(self)
      self.name = name
      self.url = url

    def run(self):
      config = Configuration()
      config.browser_user_agent = 'Mozilla 5.0'
      config.http_timeout = 20 
      g = Goose(config)
      fname = os.path.basename(self.name)
      with open(fname +".txt","w+") as f_handler:
           for tmp in url:
              article = g.extract(url=tmp)
              contents = article.cleaned_text
              f_handler.write(contents)
       msg = "%s was finished downloaded with this link %s!" % (self.name, 
          self.url)
       print(msg)


def main(url):
   for item , url in enumerate(url):
     name = "Thread %s" % (item+1)
     fetch = FetchUrl(url, name)
     fetch.start()

if __name__ == "__main__":
   u_path = 'url_list/url.txt'
   url = []
   for line in open(u_path):
        line = line.strip()
        url.append(line)
        print(line)
main(url)      
3
  • This code is heavily misindented, so if you're working with this exact code, it won't even run. Please post the actual code and format it properly. Commented Aug 3, 2019 at 9:11
  • edited code in my question is actual code @ForceBru Commented Aug 3, 2019 at 9:27
  • 1
    Why are you not using self.url instead of url in for loop of run method? Commented Aug 3, 2019 at 9:45

1 Answer 1

0

Your variable contents is being overwritten, that way when it exists the for tmp in url: loop, only contents of last url are in the contents variable. Try something like,

# open file in write mode
    # loop over urls
        # extract url contents
        # clean it
        # write to file
Sign up to request clarification or add additional context in comments.

5 Comments

can you look at my edited code in question which i am following
Isn't a file write access locked when it is being accessed from one thread?
As now the code is according to your suggestion, and what it does now is that it stores the only content of first url link
Take a look at the accepted answer.
My suggestion was for a normal case, you hadn't mentioned threads yet.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.