1

Let’s say I have several very large vectors. They are stored on disk. I need to access them individually by reading from each respective file which would place them into memory. I would perform some function on a single vector and then move to the next one I need access. I need to be able to instruct each vector in memory to be garbage collected every time I need to access a different vector. I’m not sure if performMajorGC would ensure that the vector would be garbage collected if it is stated in my program that I have to access that same vector again later by referencing the same function name that read the vector in from disk. In such a case I would read it into memory again, use it, then garbage collect it. How would I ensure it’s garage collection while using the same function name for the vector that is read from the same file?

Would appreciate any advice thanks

In response to Daniel Wagner:

    myvec x :: Int -> IO (Vector (Vector ByteString))
    myvec x = do let ioy = do y <- Data.ByteString.Lazy.readFile ("data.csv" ++ (show x))
                              guard (isRight (Data.Csv.decode NoHeader y)) 
                              return y
                 yy <- ioy 
                 return (head $ snd $ partitionEithers [Data.Csv.decode NoHeader yy])

    myvecvec :: Vector (IO (Vector (Vector ByteString)))
    myvecvec = generate 100 (\x -> myvec x)

    somefunc1 :: IO (Vector (Vector ByteString)) -> IO ()
    somefunc1 iovv = do vv <- iovv
                        somefunc1x1 vv :: Vector (Vector ByteString) -> IO ()  

-- same thing for somefunc2 and 3

    oponvec :: IO ()
    oponvec = do somefunc1 (myvecvec ! 0)
                 performGC
                 somefunc2 (myvecvec ! 1)
                 performGC
                 somefunc3 (myvecvec ! 0)
    
4
  • 1
    Reading into memory is an IO action, presumably. A reference to an IO action does not hold a reference to the result produced by that action. So almost certainly performMajorGC is enough. But to be really sure, we need to see some code. Commented Mar 22, 2022 at 17:00
  • @DanielWagner So would the code i wrote in my edit effectively result in garbage collection of each vector as i intended between operations? Would i even need to use performGC? Can i rely on the garbage collector to collect each vector as i described without explicitly using performGC? Commented Mar 22, 2022 at 18:11
  • 1
    ...yes, nothing of consequence is retained from one line of oponvec to the next. But wow, this code can be improved a lot. Why does ioy deserve a name (as opposed to myvec x = do { y <- readFile ("data.csv" ++ show x); case decode NoHeader y of { Left err -> die (show err); Right v -> return v }}? Why do myvecvec and somefunc1 even exist (as opposed to oponvec = do { myvec 0 >>= somefunc1x1; myvec 1 >>= somefunc1x1; myvec 0 >>= somefunc1x1 })? Commented Mar 22, 2022 at 18:23
  • @DanielWagner Sorry that was very sloppy of me Commented Mar 22, 2022 at 18:27

2 Answers 2

2

You can test this by using a weak pointer as follows:

import qualified Data.Vector.Unboxed as V
import System.Mem.Weak
import System.Mem

main :: IO ()
main = do
  let xs = V.fromList [1..1000000:: Int]
  wkp <- mkWeakPtr xs Nothing
  performGC
  xs' <- deRefWeak wkp
  print xs'

On my system this prints Nothing which means that the vector has been deallocated. However, I don't know if GHC guarantees that this happens.

Here's a program which checks @amalloy's suggestion:

import qualified Data.Vector.Unboxed as V
import Control.Monad
import Data.Word

{-# NOINLINE newLarge #-}
newLarge :: Word8 -> V.Vector Word8
newLarge n = V.replicate 5000000000 n -- 5GB

main :: IO ()
main = forM_ [1..10] $ \i -> print (V.sum (newLarge i))

This uses exactly 5GB on my machine, which shows that there are never two large vectors allocated at the same time.

Sign up to request clarification or add additional context in comments.

Comments

2

I need to be able to instruct each vector in memory to be garbage collected every time I need to access a different vector.

Do you? Why? If it's simply because they're large and you're worried about fitting the vector in memory, then don't worry about it. If memory space is needed, and the object is unreachable, then garbage collection will pick it up. If memory space is not needed, you don't need to do anything. And if the object is reachable, running the GC won't help. So there are no cases where manual intervention in GC will do any good.

And if you want to GC it for some other reason than freeing up memory, you need to explain that in the question, because that goal will surely affect answers.

2 Comments

Please see my comment to Daniel Wagner and my edit in response to his question. I use the Swap partition very often so disabling the Swap for programs that exceed my memory to force the GC to collect without using Swap isn't really as option for me as far as I've been able to accomplish. Setting RTS -M(<size>) doesnt seem to work for preventing swap access.
The -1 is from me: that's a comment, not an answer. And I disagree, this sounds like a scenario where it could be important to deterministically GC an array before allocating another one. The standard GC isn't instantaneous, and it's quite possible to allocate way more memory than is actually needed at any given time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.