0

I am running a very simple code to read into txt files and add them to an existing dictionary. With htop I see that the used memory linearly increases until I run out of memory. Here is a simplified version of the code:

import numpy as np
data = np.load(path_dictionary, allow_pickle=True)
dic = data.item()
for ids in dic:
    output = np.loadtxt(filename)
    array = output[:,1]
    dic[ids][new_info] = array

I tried to delete the output and added a garbage collector in the loop and it has not helped.

    del output
    del array
    gc.collect()

I used a function from this post to get the size of the dictionary before and after 100 iterations. The The original dictionary is 9GB and the size increases by about 13MB, while from htop the used memory increases by 10GB. The script is supposed to read into around 70K files.

Can someone help me with what is causing the memory leak, and possible solutions for it?

5
  • 2
    70K files. What is the average size of a file? Commented Oct 20, 2021 at 9:28
  • they are around 20 MB Commented Oct 20, 2021 at 10:21
  • I am not trying to load them all at the same time. This script is supposed to load one, extract some data from it, add it to the dictionary, and go to load the next file. And I am trying to free the memory after I am finished with each file, before I load the next one. Commented Oct 20, 2021 at 10:28
  • how much is "some data" ? Commented Oct 20, 2021 at 10:30
  • it's a 1D array of float64 elements with lengths varying between 50 and 700 elements. Commented Oct 20, 2021 at 10:34

2 Answers 2

1

When you call array = output[:,1] the numpy just creates a view. Meaning that it keeps a reference to a whole (presumably large) output and the information that array is just first column. Now you save this reference to the dic meaning there still exists a reference to a whole output and garbage collector cannot free the memory.

To work around this issue just instruct numpy that it should create a copy:

array = output[:,1].copy()

That way array will contain its own copy of the data (which is slower that creating the view), but the point is that once you delete the output (either explicitly via del output or override it in the next iteration), there is no more references to the output and the memory will be freed.

Sign up to request clarification or add additional context in comments.

Comments

0

Python’s garbage collector automatically cleans up unused variables, but some big container (eg. list, dict ) always can't be collected as expected. The variable array in your code lead to memory leak because it create a new reference to output every loop, and it was referenced by dic. So you should delete array in every loop after deep copy (instead of shallow copy) it, and make dic reference the copy one.

from copy import deepcopy
for ids in dic:
  output = np.loadtxt(filename)
  array = output[:,1]
  array_c = deepcopy(array) 
  del array
  dic[ids][new_info] = array_c

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.