What do these TORCH_USE_CUDA_DSA and frozen_modules errors mean and how to fix them?

Question

I'm trying to run a Mask R-CNN model with aerial imagery. To optimise this, I run everything with CUDA. But this creates a few errors. Here is my code:

# Python
import torch
import torchvision
from torchvision.models.detection import MaskRCNN
import gc
import torch.nn as nn
from torchvision.models.detection.rpn import AnchorGenerator
from torch.cuda.amp import GradScaler
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

gc.collect()

torch.cuda.empty_cache()

# Define the model
resnet_net = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.DEFAULT)

modules = list(resnet_net.children())[:-1]
backbone = nn.Sequential(*modules)
backbone.out_channels = 512


# Define the anchor generator
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                   aspect_ratios=((0.5, 1.0, 2.0),))

# Define the model with the configured backbone and anchor generator
model = MaskRCNN(backbone=backbone, num_classes=91, rpn_anchor_generator=anchor_generator)

# Move the model to the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
scaler = GradScaler()

# Train the model
num_epochs = 5
for epoch in range(num_epochs):
    model.train()
    counter = 0
    for images, height, targets, names in train_ds:
        print(counter)
        counter += 1

        images = list(image.to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        optimizer.zero_grad()

        with torch.cuda.amp.autocast():
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())
        
        scaler.scale(losses).backward()
        scaler.step(optimizer)
        scaler.update()

If I run this code on the gpu, I will at some point get this error: RuntimeError: CUDA error: an illegal memory access was encountered Compile with "TORCH_USE_CUDA_DSA" to enable device-side assertions.

And if I run it on the cpu, I will get this error: [error] Disposing session as kernel process died ExitCode: 3221225477, Reason: 0.00s - Debugger warning: It seems that frozen modules are being used, which may 0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off 0.00s - to python to disable frozen modules. 0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.

I have encountered some CUDA memory problems before with this code, and this seems related. What are these frozen modules and is it safe to turn them off? Also, I tried to enable this TORCH_USE_CUDA_DSA in my code by adding this: os.environ["TORCH_USE_CUDA_DSA"] = "1"

But that didn't solve it. Also, I had one run where i didn't encounter any of these problems, and where the code ran smoothly (on the gpu).

Peter Chatain · Accepted Answer · 2024-05-16 13:18:29Z

Here is a link to what device side asserts and errors are: What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?. It looks like the real error here was an illegal memory access, which sometimes happens due to CUDA out of memory on the GPU.

As for the CPU case, this might also be failing due to running out of RAM. In this link, the user fixed it by running on a system with more RAM - https://github.com/microsoft/vscode-jupyter/issues/13678.

I would try running the above code with a much smaller model and see if that produces any different type of errors. Or if this is in colab, try increasing the CPU/GPU memory available.

Edit: The problem is likely with the dataset. In a Coogle Collaboratory notebook the code works on both CPU and GPU. There was a missing import os line, and the dataset was missing, so I created a fake, randomized dataset. See the code here

# Python
import os
import torch
import torchvision
from torchvision.models.detection import MaskRCNN
import gc
import torch.nn as nn
from torchvision.models.detection.rpn import AnchorGenerator
from torch.cuda.amp import GradScaler
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

gc.collect()

torch.cuda.empty_cache()

# Define the model
resnet_net = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.DEFAULT)

modules = list(resnet_net.children())[:-1]
backbone = nn.Sequential(*modules)
backbone.out_channels = 512


# Define the anchor generator
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                   aspect_ratios=((0.5, 1.0, 2.0),))

# Define the model with the configured backbone and anchor generator
model = MaskRCNN(backbone=backbone, num_classes=91, rpn_anchor_generator=anchor_generator)

# Move the model to the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
model.to(device)

# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
scaler = GradScaler()

# Train the model
num_epochs = 1
for epoch in range(num_epochs):
    model.train()
    counter = 0
    for i in range(10):
        # images, height, targets, names in train_ds.
        n_samples = 2
        images = torch.rand(n_samples, 3, 800, 800).to(device)
        height = 800
        targets = [{'boxes': torch.tensor([[0, 0, 800, 800]]), 'labels': torch.tensor([1]), 'masks': torch.rand(1, 800, 800).to(device)}] * n_samples
        print(counter)
        counter += 1

        images = list(image.to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        optimizer.zero_grad()

        with torch.cuda.amp.autocast():
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())
        
        scaler.scale(losses).backward()
        scaler.step(optimizer)
        scaler.update()

Heyya, I did what you suggested and some more. But this doesn't seem to fix any issues. What else could i try?
Without the dataset I'm not sure I can investigate further. When I run the code as is with minor modifications, it appears to work. See the updated comment with code that works.

Collectives™ on Stack Overflow

What do these TORCH_USE_CUDA_DSA and frozen_modules errors mean and how to fix them?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related