I have a set of Boolean functions that are independent, and (hypothetically) can be executed in parallel. I want to call those same functions repeatedly. See the code below, in which the outputs of the functions ping-pong between the A and B memory locations. How can I force the "IN PARALLEL" lines to be run in parallel on an NVIDIA GPU with CUDA installed?
import torch
A = torch.tensor([True, False, True]).to('cuda') # Initial values.
B = torch.tensor([False, True, True]).to('cuda') # Values don't matter. Will write over them in the first iteration.
n_steps = 100
for step in range(n_steps):
# Use values in A to compute new values in B.
# How to run the three lines below IN PARALLEL?
B[0] = torch.logical_and(torch.logical_or( A[0], A[1]), A[2]) # func1: Y0 = X0 | X1 & X2
B[1] = torch.logical_or( torch.logical_or( A[0], A[1]), A[2]) # func2: Y1 = X0 | X1 | X2
B[2] = torch.logical_and(torch.logical_and(A[0], A[1]), A[2]) # func3: Y2 = X0 & X1 & X2
# Only after the above three lines above finish their computation (and B has new values), should the lines below be run.
# Use values in B to compute new values in A.
# Note that the functions below are identical to the ones above (which may allow for some additional acceleration?)
# How to run the three lines below IN PARALLEL?
A[0] = torch.logical_and(torch.logical_or( B[0], B[1]), B[2]) # func1: Y0 = X0 | X1 & X2
A[1] = torch.logical_or( torch.logical_or( B[0], B[1]), B[2]) # func2: Y1 = X0 | X1 | X2
A[2] = torch.logical_and(torch.logical_and(B[0], B[1]), B[2]) # func3: Y2 = X0 & X1 & X2
# Only after the above three lines above finish their computation (and A has new values), should the next loop be run.