19

I have some kind of high level code, so model training and etc. are wrapped by pipeline_network class. My main goal is to train new model every new fold.

for train_idx, valid_idx in cv.split(meta_train[DEPTH_COLUMN].values.reshape(-1)):

        meta_train_split, meta_valid_split = meta_train.iloc[train_idx], meta_train.iloc[valid_idx]

        pipeline_network = unet(config=CONFIG, suffix = 'fold' + str(fold), train_mode=True)

But then I move on to 2nd fold everything fails out of gpu memory:

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

At the end of epoch I tried to manually delete that pipeline with no luck:

 def clean_object_from_memory(obj): #definition
    del obj
    gc.collect()
    torch.cuda.empty_cache()

clean_object_from_memory( clean_object_from_memory) # calling

Calling this didn't help as well:

def dump_tensors(gpu_only=True):
        torch.cuda.empty_cache()
        total_size = 0
        for obj in gc.get_objects():
            try:
                if torch.is_tensor(obj):
                    if not gpu_only or obj.is_cuda:
                        del obj
                        gc.collect()
                elif hasattr(obj, "data") and torch.is_tensor(obj.data):
                    if not gpu_only or obj.is_cuda:
                        del obj
                        gc.collect()
            except Exception as e:
                pass

How can reset pytorch then I move on to the next fold?

1 Answer 1

15

Try delete the object with del and then apply torch.cuda.empty_cache(). The reusable memory will be freed after this operation.

Sign up to request clarification or add additional context in comments.

1 Comment

I suggested that step as a well. But you right, this is the main step

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.