Web scraping loop with Haskell

Question

I want to learn Haskell and I have another small project (currently in Elixir) that I'd like to port as an exercise. It is a simple web scraper that scrapes a list of urls.

Imagine having a list of zip codes, around 2500 items. For each entry, a web page should be scraped, in the form of http://www.acme.org/zip-info?zip={ZIP}. I managed to write the code to crawl a single web page using Scalpel.

But how would I go about scraping the 2500 items? In Elixir I map over the list of postal codes and after each page request there is a short sleep of 1 second, just to ease off pressure on the targeted website. It is not important to me to scrape the website as fast as possible.

How would I do this in Haskell? I read about threadSleep but how do I use that in combination of the list to traverse and the main method, since the sleep is side effect.

Thanks for the insights!

Noughtmare · Accepted Answer · 2021-11-09 10:59:58Z

4

Presumably you already have a function like:

scrapeZip :: Zip -> IO ZipResult

Then you can write a function with traverse to get an IO action that returns a list of zip results:

scrapeZips :: [Zip] -> IO [ZipResult]
scrapeZips zipCodes = traverse scrapeZip zipCodes

But you want to add a delay, which can be done using threadDelay (you can import it from Control.Concurrent):

scrapeZipDelay :: Zip -> IO ZipResult
scrapeZipDelay zip = do
  x <- scrapeZip zip
  threadDelay 1000000 -- one second in microseconds
  return x

And then you can use this scrapeZipDelay with traverse:

scrapeZipsDelay :: [Zip] -> IO [ZipResult]
scrapeZipsDelay zipCodes = traverse scrapeZipDelay zipCodes

Instead of defining a whole new scrapeZipDelay function you can also write a pretty small version with the <* operator:

scrapeZipsDelay :: [Zip] -> IO [ZipResult]
scrapeZipsDelay zipCodes = 
  traverse (\zip -> scrapeZip zip <* threadDelay 1000000) zipCodes

edited Nov 9, 2021 at 10:59

answered Nov 9, 2021 at 10:01

Noughtmare

11.3k1 gold badge18 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jeroen Bourgois Over a year ago

thank you for this answer, it makes sense. I'll have to read up on the <* operator, but I will keep that for later. Enough to think and learn about already :). Thank you again, I'll give it a go and mark as an answer according to that.

Daniel Wagner Over a year ago

@JeroenBourgois Not much reading needed, really. If you think of it as being defined as a <* b = do { x <- a; b; return x } -- notice the parallels to scrapeZipDelay -- you will be wrong only in unimportant ways.

Collectives™ on Stack Overflow

Web scraping loop with Haskell

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related