memory leak while adding elements from text file to dictionary in python

Question

I am running a very simple code to read into txt files and add them to an existing dictionary. With htop I see that the used memory linearly increases until I run out of memory. Here is a simplified version of the code:

import numpy as np
data = np.load(path_dictionary, allow_pickle=True)
dic = data.item()
for ids in dic:
    output = np.loadtxt(filename)
    array = output[:,1]
    dic[ids][new_info] = array

I tried to delete the output and added a garbage collector in the loop and it has not helped.

    del output
    del array
    gc.collect()

I used a function from this post to get the size of the dictionary before and after 100 iterations. The The original dictionary is 9GB and the size increases by about 13MB, while from htop the used memory increases by 10GB. The script is supposed to read into around 70K files.

Can someone help me with what is causing the memory leak, and possible solutions for it?

I am not trying to load them all at the same time. This script is supposed to load one, extract some data from it, add it to the dictionary, and go to load the next file. And I am trying to free the memory after I am finished with each file, before I load the next one. — Nassim
– Nassim, Commented Oct 20, 2021 at 10:28
it's a 1D array of float64 elements with lengths varying between 50 and 700 elements. — Nassim
– Nassim, Commented Oct 20, 2021 at 10:34

Drecker · Accepted Answer · 2021-10-20 10:54:55Z

1

When you call array = output[:,1] the numpy just creates a view. Meaning that it keeps a reference to a whole (presumably large) output and the information that array is just first column. Now you save this reference to the dic meaning there still exists a reference to a whole output and garbage collector cannot free the memory.

To work around this issue just instruct numpy that it should create a copy:

array = output[:,1].copy()

That way array will contain its own copy of the data (which is slower that creating the view), but the point is that once you delete the output (either explicitly via del output or override it in the next iteration), there is no more references to the output and the memory will be freed.

edited Oct 20, 2021 at 10:54

answered Oct 20, 2021 at 9:59

Drecker

1,26511 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Livmortis · Accepted Answer · 2022-06-01 03:05:18Z

0

Python’s garbage collector automatically cleans up unused variables, but some big container (eg. list, dict ) always can't be collected as expected. The variable array in your code lead to memory leak because it create a new reference to output every loop, and it was referenced by dic. So you should delete array in every loop after deep copy (instead of shallow copy) it, and make dic reference the copy one.

from copy import deepcopy
for ids in dic:
  output = np.loadtxt(filename)
  array = output[:,1]
  array_c = deepcopy(array) 
  del array
  dic[ids][new_info] = array_c

answered Jun 1, 2022 at 3:05

Livmortis

1511 silver badge3 bronze badges

Collectives™ on Stack Overflow

memory leak while adding elements from text file to dictionary in python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related