2

I'm trying to run a Mask R-CNN model with aerial imagery. To optimise this, I run everything with CUDA. But this creates a few errors. Here is my code:

# Python
import torch
import torchvision
from torchvision.models.detection import MaskRCNN
import gc
import torch.nn as nn
from torchvision.models.detection.rpn import AnchorGenerator
from torch.cuda.amp import GradScaler
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

gc.collect()

torch.cuda.empty_cache()

# Define the model
resnet_net = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.DEFAULT)

modules = list(resnet_net.children())[:-1]
backbone = nn.Sequential(*modules)
backbone.out_channels = 512


# Define the anchor generator
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                   aspect_ratios=((0.5, 1.0, 2.0),))

# Define the model with the configured backbone and anchor generator
model = MaskRCNN(backbone=backbone, num_classes=91, rpn_anchor_generator=anchor_generator)

# Move the model to the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
scaler = GradScaler()

# Train the model
num_epochs = 5
for epoch in range(num_epochs):
    model.train()
    counter = 0
    for images, height, targets, names in train_ds:
        print(counter)
        counter += 1

        images = list(image.to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        optimizer.zero_grad()

        with torch.cuda.amp.autocast():
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())
        
        scaler.scale(losses).backward()
        scaler.step(optimizer)
        scaler.update()

If I run this code on the gpu, I will at some point get this error: RuntimeError: CUDA error: an illegal memory access was encountered Compile with "TORCH_USE_CUDA_DSA" to enable device-side assertions.

And if I run it on the cpu, I will get this error: [error] Disposing session as kernel process died ExitCode: 3221225477, Reason: 0.00s - Debugger warning: It seems that frozen modules are being used, which may 0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off 0.00s - to python to disable frozen modules. 0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.

I have encountered some CUDA memory problems before with this code, and this seems related. What are these frozen modules and is it safe to turn them off? Also, I tried to enable this TORCH_USE_CUDA_DSA in my code by adding this: os.environ["TORCH_USE_CUDA_DSA"] = "1"

But that didn't solve it. Also, I had one run where i didn't encounter any of these problems, and where the code ran smoothly (on the gpu).

1 Answer 1

0

Here is a link to what device side asserts and errors are: What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?. It looks like the real error here was an illegal memory access, which sometimes happens due to CUDA out of memory on the GPU.

As for the CPU case, this might also be failing due to running out of RAM. In this link, the user fixed it by running on a system with more RAM - https://github.com/microsoft/vscode-jupyter/issues/13678.

I would try running the above code with a much smaller model and see if that produces any different type of errors. Or if this is in colab, try increasing the CPU/GPU memory available.

Edit: The problem is likely with the dataset. In a Coogle Collaboratory notebook the code works on both CPU and GPU. There was a missing import os line, and the dataset was missing, so I created a fake, randomized dataset. See the code here

# Python
import os
import torch
import torchvision
from torchvision.models.detection import MaskRCNN
import gc
import torch.nn as nn
from torchvision.models.detection.rpn import AnchorGenerator
from torch.cuda.amp import GradScaler
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

gc.collect()

torch.cuda.empty_cache()

# Define the model
resnet_net = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.DEFAULT)

modules = list(resnet_net.children())[:-1]
backbone = nn.Sequential(*modules)
backbone.out_channels = 512


# Define the anchor generator
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                   aspect_ratios=((0.5, 1.0, 2.0),))

# Define the model with the configured backbone and anchor generator
model = MaskRCNN(backbone=backbone, num_classes=91, rpn_anchor_generator=anchor_generator)

# Move the model to the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
model.to(device)

# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
scaler = GradScaler()

# Train the model
num_epochs = 1
for epoch in range(num_epochs):
    model.train()
    counter = 0
    for i in range(10):
        # images, height, targets, names in train_ds.
        n_samples = 2
        images = torch.rand(n_samples, 3, 800, 800).to(device)
        height = 800
        targets = [{'boxes': torch.tensor([[0, 0, 800, 800]]), 'labels': torch.tensor([1]), 'masks': torch.rand(1, 800, 800).to(device)}] * n_samples
        print(counter)
        counter += 1

        images = list(image.to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        optimizer.zero_grad()

        with torch.cuda.amp.autocast():
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())
        
        scaler.scale(losses).backward()
        scaler.step(optimizer)
        scaler.update()
Sign up to request clarification or add additional context in comments.

2 Comments

Heyya, I did what you suggested and some more. But this doesn't seem to fix any issues. What else could i try?
Without the dataset I'm not sure I can investigate further. When I run the code as is with minor modifications, it appears to work. See the updated comment with code that works.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.