5

I'm trying to write a Haskell program to parse huge text file (about 14Gb), but i can't understand how to make it free unused data from memory or not to make stack overflow during foldr. Here is the program source:

import qualified Data.ByteString.Lazy.Char8 as LBS
import qualified Data.ByteString.Lex.Lazy.Double as BD
import System.Environment


data Vertex = 
    Vertex{
     vertexX :: Double,
     vertexY :: Double,
     vertexZ :: Double}
    deriving (Eq, Show, Read)

data Extent = 
    Extent{
     extentMax :: Vertex,
     extentMin :: Vertex}
    deriving (Eq, Show, Read)

addToExtent :: Extent -> Vertex -> Extent
addToExtent ext vert = Extent vertMax vertMin where
                        (vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMin ext) vert) where
                            makeCmpVert f v1 v2 = Vertex(f (vertexX v1) (vertexX v2))
                                                        (f (vertexY v1) (vertexY v2))
                                                        (f (vertexZ v1) (vertexZ v2))

readCoord :: LBS.ByteString -> Double
readCoord l = case BD.readDouble l of
                Nothing -> 0
                Just (value, _) -> value

readCoords :: LBS.ByteString -> [Double]
readCoords l | LBS.length l == 0 = []
             | otherwise = let coordWords = LBS.split ' ' l 
                            in map readCoord coordWords

parseLine :: LBS.ByteString -> Vertex
parseLine line = Vertex (head coords) (coords!!1) (coords!!2) where
    coords = readCoords line 

processLines :: [LBS.ByteString] -> Extent -> Extent
processLines strs ext = foldr (\x y -> addToExtent y (parseLine x)) ext strs

processFile :: String -> IO()
processFile name = do
    putStrLn name
    content <- LBS.readFile name
    let (countLine:recordsLines) = LBS.lines content
    case LBS.readInt countLine of
        Nothing -> putStrLn "Can't read records count"
        Just (recordsCount, _) -> do
                                    print recordsCount
                                    let vert = parseLine (head recordsLines)
                                    let ext = Extent vert vert
                                    print $ processLines recordsLines ext

main :: IO()
main = do
        args <- getArgs
        case args of
            [] -> do
                putStrLn "Missing file path"                    
            xs -> do
                    processFile (head xs)
                    return()

Text file contains lines with three floating point numbers delimited with space character. This program always tries to occupy all free memory on a computer and crashes with out of memory error.

3
  • Note: I think you have a mistake in addToExtent, see added note in my answer. Commented Apr 11, 2013 at 15:54
  • Thanks, yes it's a mistake. I will fix it. Commented Apr 11, 2013 at 16:03
  • what version of GHC are you using, and how are you compiling? Commented Apr 11, 2013 at 19:30

2 Answers 2

5

You are being too lazy. Vertex and Extent have non-strict fields, and all your functions returning a Vertex return

Vertex thunk1 thunk2

without forcing the components to be evaluated. Also addToExtent directly returns an

Extent thunk1 thunk2

without evaluating the components.

Thus none of the ByteStrings actually is released early to be garbage-collected, since the Doubles are not parsed from them yet.

When that is fixed by making the fields of Vertex and Extent strict - or the functions returning a Vertex resp. Extent forcing all parts of their input, you have the problem that

processLines strs ext = foldr (\x y -> addToExtent y (parseLine x)) ext strs

can't start assembling the result before the end of the list of lines is reached because then

(\x y -> addToExtent y (parseLine x))

is strict in its second argument.

However, barring NaNs and undefined values, if I didn't miss something, the result would be the same if you use a (strict!) left fold, so

processLines strs ext = foldl' (\x y -> addToExtent x (parseLine y)) ext strs

should produce the desired result without holding on to the data if Vertex and Extent get strict fields.


Ah, I did miss something:

addToExtent ext vert = Extent vertMax vertMin
  where
    (vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMin ext)

If that isn't a typo (what I expect it is), fixing that would be somewhat difficult.

I think it should be

    (vertMax, vertMin) = ...
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the answer, it's really solved my problem when i make data fields strict and used strict fold' (I've tried these options separate but it didn't give anything). But how to know when the laziness will ends, can you advise some materials to read.
I think Real World Haskell treats laziness vs. strictness to some extent. But it's mostly experience. You learn when laziness is beneficial and when not by experience. And how to fix space leaks (after determining whether they're caused by too much laziness or too much strictness).
I've read this book, but how to use laziness correctly i can't understand yet. It seems i need more practice, as you say.
1

addToExtent is too lazy. A possible alternative definition is

addToExtent :: Extent -> Vertex -> Extent
addToExtent ext vert = vertMax `seq` vertMin `seq` Extent vertMax vertMin where
  (vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMinext) vert) where
    makeCmpVert f v1 v2 = Vertex(f (vertexX v1) (vertexX v2))
                      (f (vertexY v1) (vertexY v2))
                      (f (vertexZ v1) (vertexZ v2))

data Vertex = 
    Vertex{
     vertexX :: {-# UNPACK #-} !Double,
     vertexY :: {-# UNPACK #-} !Double,
     vertexZ :: {-# UNPACK #-} !Double}
    deriving (Eq, Show, Read)

The problem is that vertMin and vertMax are never evaluated until the entire file is processed - resulted in two huge thunks in Extent.

I also recommend changing the definition of Extent to

data Extent = 
    Extent{
     extentMax :: !Vertex,
     extentMin :: !Vertex}
    deriving (Eq, Show, Read)

(though with these changes, the seq calls in addToExtent become redundant).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.