4

I have a following program in Haskell:

processDate :: String -> IO ()
processDate date = do
    ...
    let newFlattenedPropertiesWithPrice = filter (notYetInserted date existingProperties) flattenedPropertiesWithPrice
    geocodedProperties <- propertiesWithGeocoding newFlattenedPropertiesWithPrice

propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
    let addresses = fmap location properties
    let batchAddresses = chunksOf 100 addresses
    batchGeocodedLocations <- mapM geocodeAddresses batchAddresses
    let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
    return (zip properties geocodedLocations)

geocodeAddresses :: [String] -> IO (Maybe [Maybe LatLng])
geocodeAddresses addresses = do
    mapQuestKey <- getEnv "MAP_QUEST_KEY"
    geocodeResponse <- openURL $ mapQuestUrl mapQuestKey addresses
    return $ geocodeResponseToResults geocodeResponse

geocodeResponseToResults :: String -> Maybe [Maybe LatLng]
geocodeResponseToResults inputResponse =
    latLangs
    where
        decodedResponse :: Maybe GeocodingResponse
        decodedResponse = decodeGeocodingResponse inputResponse

        latLangs = fmap (fmap geocodingResultToLatLng . results) decodedResponse

decodeGeocodingResponse :: String -> Maybe GeocodingResponse
decodeGeocodingResponse inputResponse = Data.Aeson.decode (fromString inputResponse) :: Maybe GeocodingResponse  

It reads a list of properties (homes and apartments) from html files, parses them, geocodes the addresses and saves the results into sqlite db.
Everything works fine except for a very high memory usage (around 800M).
By commenting code out I have pinpointed the problem to be the geocoding step.
I send 100 addresses at a time to MapQuest api (https://developer.mapquest.com/documentation/geocoding-api/batch/get/).
The response for 100 addresses is quite massive so it might be one of the culprits, but 800M? I feel like it holds to all of the results until the end which drives the memory usage so high.

After commenting out the geocoding part of the program memory usage is around 30M which is fine.

You can get the full version which reproduces the issue here: https://github.com/Leonti/haskell-memory-so

enter image description here

I'm quite a newbie in Haskell, so not sure how I can optimize it.
Any ideas?

Cheers!

14
  • I suspect that GC just does not kick in, because you have plenty of RAM available, and not running GC is faster than running it needlessly. It's a pretty common pattern in GC-ed languages. Try limiting the heap available and see if it still fits. Commented Jan 28, 2017 at 0:54
  • 1
    @9000 That is unlikely to help. The OP is probably right about intermediate results being held for too long. The mapM in propertiesWithGeocoding is a likely culprit (and, if that's indeed the case, the answers here will likely involve streaming libraries such as pipes and conduit, which are commonly used to provide alternatives to mapM when dealing with large amounts of data). Commented Jan 28, 2017 at 1:47
  • 1
    I wish I had a runnable fragment. The solution may partly be to use a streaming library like streaming or conduit or pipes. But there are other peculiarities. Keep in mind first that aeson does tend to accumulate the whole input in memory for parsing. (This is the only way to do it that would work for all json). There is a json-stream library can get around this in some cases depending what you are looking for in the json. Commented Jan 28, 2017 at 3:20
  • 1
    Another peculiarity is the way you use properties twice over in propertiesWithGeocoding, if I'm following. First you run along it to get locations, then you chunk it, then you use mapM and sequence (these are where are streaming library like streaming helps and it would nice to have a runnable fragment to show how refactoring goes), All of this, so far, can be made to 'stream' beautifully without mapM which tends to accumulate the whole list. Commented Jan 28, 2017 at 4:11
  • 2
    Finally though, you zip the result with properties again, if I follow. This is a problem for a properly streaming (= low memory use) program since it entails that the first time you went through properties everything has to be saved since the compiler sees it will need it again. Commented Jan 28, 2017 at 4:16

1 Answer 1

3

It might be worth recording that this turned out to be a simple streaming problem arising from use of mapM and sequence, which with replicateM and traverse and other things that make you "extract a list from IO" always raise accumulation worries. So a little detour by a streaming library was needed. So in the repo it was necessary just to replace

processDate :: String -> IO ()
processDate date = do
    allFiles <- listFiles date
    allProperties <- mapM fileToProperties allFiles
    let flattenedPropertiesWithPrice = filter hasPrice $ concat allProperties
    geocodedProperties <- propertiesWithGeocoding flattenedPropertiesWithPrice
    print geocodedProperties

propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
    let batchProperties = chunksOf 100 properties
    batchGeocodedLocations <- mapM geocodeAddresses batchProperties
    let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
    return geocodedLocations

with something like this

import Streaming
import qualified Streaming.Prelude as S

processDate :: String -> IO ()
processDate date = do
    allFiles <- listFiles date   -- we accept an unstreamed list
    S.print $ propertiesWithGeocoding -- this was the main pain point see below
            $ S.filter hasPrice 
            $ S.concat 
            $ S.mapM fileToProperties -- this mapM doesn't accumulate
            $ S.each allFiles    -- the list is converted to a stream

propertiesWithGeocoding
  :: Stream (Of ParsedProperty) IO r
     -> Stream (Of (ParsedProperty, Maybe LatLng)) IO r
propertiesWithGeocoding properties =  
    S.concat $ S.concat 
             $ S.mapM geocodeAddresses -- this mapM doesn't accumulate results from mapquest
             $ S.mapped S.toList       -- convert segments to haskell lists
             $ chunksOf 100 properties -- this is the streaming `chunksOf`
    -- concat here flattens a stream of lists of as into a stream of as
    -- and a stream of maybe as into a stream of as

Then the memory use looks like so, each peak corresponding to a trip to Mapquest promply followed by a little processing and a print, whereupon ghc forgets all about it and moves on:

Of course this could be done with pipes or conduit. But here we just need a little bit of simple mapM / sequence/ traverse / replicateM avoidance and streaming is perhaps simplest for this sort of quick local refactoring. Note that this list is quite short so the thought 'but short lists are cool with mapM/traverse/etc !" can be quite spectacularly false. Why not just get rid of them? Whenever you are about to write list mapM f it is a good idea to consider S.mapM f . S.each (or conduit or pipes equivalent) . You will now have a stream and can recover a list with S.toList or an equivalent, but it is likely that, as in this case, you will find you don't need a reified accumulated list but can e.g. use some streaming process like printing to file or stdout or writing things to a database, after making whatever list like manipulations are needed (here we use eg. streaming filter and also concat to flatten streamed lists and as a sort of catMaybe).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.