6

When I trained my pytorch model on GPU device,my python script was killed out of blue.Dives into OS log files , and I find script was killed by OOM killer because my CPU ran out of memory.It’s very strange that I trained my model on GPU device but I ran out of my CPU memory. Snapshot of OOM killer log file enter image description here

In order to debug this issue,I install python memory profiler. Viewing log file from memory profiler, I find when column wise -= operation occurred, my CPU memory gradually increased until OOM killer killed my program. Snapshot of Python memory profiler enter image description here It’s very strange, I try many ways to solve this issue.Finally, I found before assignment operation,I detach Tensor first.Amazingly,it solves this issue.But I don’t understand clearly why it works.Here is my original function code.

def GeneralizedNabla(self, image):
        pad_size = 2
        affinity = torch.zeros(image.shape[0], self.window_size**2, self.h, self.w).to(self.device)
        h = self.h+pad_size
        w = self.w+pad_size
        #pad = nn.ZeroPad2d(pad_size)
        image_pad = self.pad(image)
        for i in range(0, self.window_size**2):
            affinity[:, i, :, :] = image[:, :, :].detach()  # initialization
            dy = int(i/5)-2
            dx = int(i % 5)-2
            h_start = pad_size+dy
            h_end = h+dy  # if 0 <= dy else h+dy
            w_start = pad_size+dx
            w_end = w+dx  # if 0 <= dx else w+dx
            affinity[:, i, :, :] -= image_pad[:, h_start:h_end, w_start:w_end].detach()
        self.Nabla=affinity
        return

If everyone has any ideas,I will appreciate very much, thank you.

1 Answer 1

4

Previously when you did not use the .detach() on your tensor, you were also accumulating the computation graph as well and as you went on, you kept acumulating more and more until you ended up exuasting your memory to the point it crashed.
When you do a detach(), you are effectively getting the data without the previously entangled history thats needed for computing the gradients.

Sign up to request clarification or add additional context in comments.

3 Comments

It’s reasonable, thank you.There is still something unusual. I trained my model on GPU device , so the gradient of previously entangled history should be on GPU device.However,I ran out of my CPU memory!!! Do you have any ideas about this?Thank you
I haven't actually noticed that part. if the code is the same for both phases, you should have seen it during training as well, unless the phases are not using the very same exact code. posting an MRE (minimal reproducible example) would be of great help in such situations.
I spend some time surveying some documents, and I find the data structure of computation graph is stored in CPU not in GPU.Hence, your explanation is very clear and straightforward, thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.