Why does my Haskell program ends with out of memory error?

Question

I'm trying to write a Haskell program to parse huge text file (about 14Gb), but i can't understand how to make it free unused data from memory or not to make stack overflow during foldr. Here is the program source:

import qualified Data.ByteString.Lazy.Char8 as LBS
import qualified Data.ByteString.Lex.Lazy.Double as BD
import System.Environment


data Vertex = 
    Vertex{
     vertexX :: Double,
     vertexY :: Double,
     vertexZ :: Double}
    deriving (Eq, Show, Read)

data Extent = 
    Extent{
     extentMax :: Vertex,
     extentMin :: Vertex}
    deriving (Eq, Show, Read)

addToExtent :: Extent -> Vertex -> Extent
addToExtent ext vert = Extent vertMax vertMin where
                        (vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMin ext) vert) where
                            makeCmpVert f v1 v2 = Vertex(f (vertexX v1) (vertexX v2))
                                                        (f (vertexY v1) (vertexY v2))
                                                        (f (vertexZ v1) (vertexZ v2))

readCoord :: LBS.ByteString -> Double
readCoord l = case BD.readDouble l of
                Nothing -> 0
                Just (value, _) -> value

readCoords :: LBS.ByteString -> [Double]
readCoords l | LBS.length l == 0 = []
             | otherwise = let coordWords = LBS.split ' ' l 
                            in map readCoord coordWords

parseLine :: LBS.ByteString -> Vertex
parseLine line = Vertex (head coords) (coords!!1) (coords!!2) where
    coords = readCoords line 

processLines :: [LBS.ByteString] -> Extent -> Extent
processLines strs ext = foldr (\x y -> addToExtent y (parseLine x)) ext strs

processFile :: String -> IO()
processFile name = do
    putStrLn name
    content <- LBS.readFile name
    let (countLine:recordsLines) = LBS.lines content
    case LBS.readInt countLine of
        Nothing -> putStrLn "Can't read records count"
        Just (recordsCount, _) -> do
                                    print recordsCount
                                    let vert = parseLine (head recordsLines)
                                    let ext = Extent vert vert
                                    print $ processLines recordsLines ext

main :: IO()
main = do
        args <- getArgs
        case args of
            [] -> do
                putStrLn "Missing file path"                    
            xs -> do
                    processFile (head xs)
                    return()

Text file contains lines with three floating point numbers delimited with space character. This program always tries to occupy all free memory on a computer and crashes with out of memory error.

Note: I think you have a mistake in addToExtent, see added note in my answer. — Daniel Fischer
– Daniel Fischer, Commented Apr 11, 2013 at 15:54
what version of GHC are you using, and how are you compiling? — jberryman
– jberryman, Commented Apr 11, 2013 at 19:30

Daniel Fischer · Accepted Answer · 2013-04-11 15:53:27Z

5

You are being too lazy. Vertex and Extent have non-strict fields, and all your functions returning a Vertex return

Vertex thunk1 thunk2

without forcing the components to be evaluated. Also addToExtent directly returns an

Extent thunk1 thunk2

without evaluating the components.

Thus none of the ByteStrings actually is released early to be garbage-collected, since the Doubles are not parsed from them yet.

When that is fixed by making the fields of Vertex and Extent strict - or the functions returning a Vertex resp. Extent forcing all parts of their input, you have the problem that

processLines strs ext = foldr (\x y -> addToExtent y (parseLine x)) ext strs

can't start assembling the result before the end of the list of lines is reached because then

(\x y -> addToExtent y (parseLine x))

is strict in its second argument.

However, barring NaNs and undefined values, if I didn't miss something, the result would be the same if you use a (strict!) left fold, so

processLines strs ext = foldl' (\x y -> addToExtent x (parseLine y)) ext strs

should produce the desired result without holding on to the data if Vertex and Extent get strict fields.

Ah, I did miss something:

addToExtent ext vert = Extent vertMax vertMin
  where
    (vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMin ext)

If that isn't a typo (what I expect it is), fixing that would be somewhat difficult.

I think it should be

    (vertMax, vertMin) = ...

edited Apr 11, 2013 at 15:53

answered Apr 11, 2013 at 15:29

Daniel Fischer

184k19 gold badges319 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

KolKir Over a year ago

Thanks for the answer, it's really solved my problem when i make data fields strict and used strict fold' (I've tried these options separate but it didn't give anything). But how to know when the laziness will ends, can you advise some materials to read.

Daniel Fischer Over a year ago

I think Real World Haskell treats laziness vs. strictness to some extent. But it's mostly experience. You learn when laziness is beneficial and when not by experience. And how to fix space leaks (after determining whether they're caused by too much laziness or too much strictness).

KolKir Over a year ago

I've read this book, but how to use laziness correctly i can't understand yet. It seems i need more practice, as you say.

ScootyPuff · Accepted Answer · 2013-04-11 15:40:57Z

addToExtent is too lazy. A possible alternative definition is

addToExtent :: Extent -> Vertex -> Extent
addToExtent ext vert = vertMax `seq` vertMin `seq` Extent vertMax vertMin where
  (vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMinext) vert) where
    makeCmpVert f v1 v2 = Vertex(f (vertexX v1) (vertexX v2))
                      (f (vertexY v1) (vertexY v2))
                      (f (vertexZ v1) (vertexZ v2))

data Vertex = 
    Vertex{
     vertexX :: {-# UNPACK #-} !Double,
     vertexY :: {-# UNPACK #-} !Double,
     vertexZ :: {-# UNPACK #-} !Double}
    deriving (Eq, Show, Read)

The problem is that vertMin and vertMax are never evaluated until the entire file is processed - resulted in two huge thunks in Extent.

I also recommend changing the definition of Extent to

data Extent = 
    Extent{
     extentMax :: !Vertex,
     extentMin :: !Vertex}
    deriving (Eq, Show, Read)

(though with these changes, the seq calls in addToExtent become redundant).

Collectives™ on Stack Overflow

Why does my Haskell program ends with out of memory error?

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related