DDP and CUDA graph in PyTorch

Question

This is my code and I am currently running it on 4 GPUs

setup(rank, gpus)


dataset = RandomDataset(input_shape, 80*batch_size, rank)

dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
data_iter = iter(dataloader)


model = model(pretrained=True).to(rank)

optimizer = optim.SGD(model.parameters(), lr=0.0001)
criterion = torch.nn.CrossEntropyLoss()

s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())

with torch.cuda.stream(s):
    print("[MAKING DDP Model]")
    model = DDP(model)
    print("[MODEL CREATED]")

    for i in range(11):
        optimizer.zero_grad(set_to_none=True)
        inputs, labels = next(data_iter)
        output = model(inputs)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

capture_input = torch.empty((batch_size, 3, input_shape, input_shape)).to(rank)
capture_target = torch.argmax(torch.from_numpy(np.eye(1000)[np.random.choice(1000, batch_size)]), axis=1).to(rank)


g = torch.cuda.CUDAGraph()

optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    capture_y_pred = model(capture_input)
    capture_loss = criterion(capture_y_pred, capture_target)
    capture_loss.backward()
optimizer.step()


print("RECORDED")

for i in range(20):
    inputs, label = next(data_iter)
    capture_input.copy_(inputs)
    capture_target.copy_(label)
    g.replay()
    optimizer.step()


print("DATASET DONE")

RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Does anyone know how to solve this problem?

I ran into similar issues with DDP training and CUDA graphs. The problem I ran into was due to this line: github.com/pytorch/pytorch/blob/…. Looks like this operation is not supported during stream capture. At this point I'm not sure whether CUDA graphs support DDP training. Couldn't find any official documentation regarding this as well. — shreyas42
– shreyas42, Commented Mar 30, 2023 at 8:01
discuss.pytorch.org/t/… According to this you need to do 11 warmup iterations, which you are doing, so I guess your problem is not the same as mine. — shreyas42
– shreyas42, Commented Mar 30, 2023 at 9:31
But please try the other suggestions in the answer. The doc has a list of steps that are required for DDP + cuda graphs. — shreyas42
– shreyas42, Commented Mar 30, 2023 at 9:42

kshishkin · Accepted Answer · 2023-09-16 18:59:28Z

0

According to the official documentation (link), it is recommended to initialize the DDP model before performing full-backward capture. Therefore, the model can be created prior to the warmup step. For instance:

s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    ddp_model = torch.nn.parallel.DistributedDataParallel(model)
torch.cuda.current_stream().wait_stream(s)

edited Sep 16, 2023 at 18:59

kshishkin

5825 silver badges10 bronze badges

answered Sep 8, 2023 at 19:31

Reza Akbarian Bafghi

1

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

DDP and CUDA graph in PyTorch

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related