Pytorch model training CPU Memory leak issue

Question

When I trained my pytorch model on GPU device,my python script was killed out of blue.Dives into OS log files , and I find script was killed by OOM killer because my CPU ran out of memory.It’s very strange that I trained my model on GPU device but I ran out of my CPU memory. Snapshot of OOM killer log file

In order to debug this issue,I install python memory profiler. Viewing log file from memory profiler, I find when column wise -= operation occurred, my CPU memory gradually increased until OOM killer killed my program. Snapshot of Python memory profiler It’s very strange, I try many ways to solve this issue.Finally, I found before assignment operation,I detach Tensor first.Amazingly,it solves this issue.But I don’t understand clearly why it works.Here is my original function code.

def GeneralizedNabla(self, image):
        pad_size = 2
        affinity = torch.zeros(image.shape[0], self.window_size**2, self.h, self.w).to(self.device)
        h = self.h+pad_size
        w = self.w+pad_size
        #pad = nn.ZeroPad2d(pad_size)
        image_pad = self.pad(image)
        for i in range(0, self.window_size**2):
            affinity[:, i, :, :] = image[:, :, :].detach()  # initialization
            dy = int(i/5)-2
            dx = int(i % 5)-2
            h_start = pad_size+dy
            h_end = h+dy  # if 0 <= dy else h+dy
            w_start = pad_size+dx
            w_end = w+dx  # if 0 <= dx else w+dx
            affinity[:, i, :, :] -= image_pad[:, h_start:h_end, w_start:w_end].detach()
        self.Nabla=affinity
        return

If everyone has any ideas,I will appreciate very much, thank you.

Hossein · Accepted Answer · 2020-12-02 12:50:53Z

4

Previously when you did not use the .detach() on your tensor, you were also accumulating the computation graph as well and as you went on, you kept acumulating more and more until you ended up exuasting your memory to the point it crashed.
When you do a detach(), you are effectively getting the data without the previously entangled history thats needed for computing the gradients.

answered Dec 2, 2020 at 12:50

Hossein

26.3k37 gold badges136 silver badges243 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

舒泓諭 Over a year ago

It’s reasonable, thank you.There is still something unusual. I trained my model on GPU device , so the gradient of previously entangled history should be on GPU device.However,I ran out of my CPU memory!!! Do you have any ideas about this?Thank you

Hossein Over a year ago

I haven't actually noticed that part. if the code is the same for both phases, you should have seen it during training as well, unless the phases are not using the very same exact code. posting an MRE (minimal reproducible example) would be of great help in such situations.

舒泓諭 Over a year ago

I spend some time surveying some documents, and I find the data structure of computation graph is stored in CPU not in GPU.Hence, your explanation is very clear and straightforward, thank you.

Collectives™ on Stack Overflow

Pytorch model training CPU Memory leak issue

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related