3

This is my code and I am currently running it on 4 GPUs

setup(rank, gpus)


dataset = RandomDataset(input_shape, 80*batch_size, rank)

dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
data_iter = iter(dataloader)


model = model(pretrained=True).to(rank)

optimizer = optim.SGD(model.parameters(), lr=0.0001)
criterion = torch.nn.CrossEntropyLoss()

s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())

with torch.cuda.stream(s):
    print("[MAKING DDP Model]")
    model = DDP(model)
    print("[MODEL CREATED]")

    for i in range(11):
        optimizer.zero_grad(set_to_none=True)
        inputs, labels = next(data_iter)
        output = model(inputs)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

capture_input = torch.empty((batch_size, 3, input_shape, input_shape)).to(rank)
capture_target = torch.argmax(torch.from_numpy(np.eye(1000)[np.random.choice(1000, batch_size)]), axis=1).to(rank)


g = torch.cuda.CUDAGraph()

optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    capture_y_pred = model(capture_input)
    capture_loss = criterion(capture_y_pred, capture_target)
    capture_loss.backward()
optimizer.step()


print("RECORDED")

for i in range(20):
    inputs, label = next(data_iter)
    capture_input.copy_(inputs)
    capture_target.copy_(label)
    g.replay()
    optimizer.step()


print("DATASET DONE")

RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Does anyone know how to solve this problem?

3
  • I ran into similar issues with DDP training and CUDA graphs. The problem I ran into was due to this line: github.com/pytorch/pytorch/blob/…. Looks like this operation is not supported during stream capture. At this point I'm not sure whether CUDA graphs support DDP training. Couldn't find any official documentation regarding this as well. Commented Mar 30, 2023 at 8:01
  • discuss.pytorch.org/t/… According to this you need to do 11 warmup iterations, which you are doing, so I guess your problem is not the same as mine. Commented Mar 30, 2023 at 9:31
  • But please try the other suggestions in the answer. The doc has a list of steps that are required for DDP + cuda graphs. Commented Mar 30, 2023 at 9:42

1 Answer 1

0

According to the official documentation (link), it is recommended to initialize the DDP model before performing full-backward capture. Therefore, the model can be created prior to the warmup step. For instance:

s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    ddp_model = torch.nn.parallel.DistributedDataParallel(model)
torch.cuda.current_stream().wait_stream(s)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.