In pytorch, how can I parallelize of a set of boolean functions that are executed (on a GPU) repeatedly?

Question

I have a set of Boolean functions that are independent, and (hypothetically) can be executed in parallel. I want to call those same functions repeatedly. See the code below, in which the outputs of the functions ping-pong between the A and B memory locations. How can I force the "IN PARALLEL" lines to be run in parallel on an NVIDIA GPU with CUDA installed?

import torch

A = torch.tensor([True, False, True]).to('cuda')  # Initial values.
B = torch.tensor([False, True, True]).to('cuda')  # Values don't matter. Will write over them in the first iteration.

n_steps = 100

for step in range(n_steps):

    # Use values in A to compute new values in B.
    # How to run the three lines below IN PARALLEL?
    B[0] = torch.logical_and(torch.logical_or( A[0], A[1]), A[2])  # func1: Y0 = X0 | X1 & X2
    B[1] = torch.logical_or( torch.logical_or( A[0], A[1]), A[2])  # func2: Y1 = X0 | X1 | X2
    B[2] = torch.logical_and(torch.logical_and(A[0], A[1]), A[2])  # func3: Y2 = X0 & X1 & X2

    # Only after the above three lines above finish their computation (and B has new values), should the lines below be run.


    # Use values in B to compute new values in A.
    # Note that the functions below are identical to the ones above (which may allow for some additional acceleration?)
    # How to run the three lines below IN PARALLEL?
    A[0] = torch.logical_and(torch.logical_or( B[0], B[1]), B[2])  # func1: Y0 = X0 | X1 & X2
    A[1] = torch.logical_or( torch.logical_or( B[0], B[1]), B[2])  # func2: Y1 = X0 | X1 | X2
    A[2] = torch.logical_and(torch.logical_and(B[0], B[1]), B[2])  # func3: Y2 = X0 & X1 & X2

    # Only after the above three lines above finish their computation (and A has new values), should the next loop be run.

I assume that you have some additional data parallelism (more dimensions to the tensors) because 3 one does not use GPUs when the amount of parallelism is 3. The easiest would be to but the three independent operations on three independent CUDA streams such that they can run concurrently. As all three take the same input, it makes sense to fuse them into one kernel although one would still not give the three operations to different threads but do them "sequentially" in a single thread. There is pipelining on GPUs, so technically the three operations would be running in parallel. — paleonix
– paleonix, Commented Nov 14, 2024 at 19:26
Thanks, @paleonix. The code is just a short example. Real implementation would have thousands (maybe 10s of thousands) of Boolean functions to execute, and the individual functions would be more complex. More precisely, the A (and there also the B) array would be thousands of bools in length, and each function would operate on 3-10 of those bools. — mattroos
– mattroos, Commented Nov 14, 2024 at 20:22
The reason I would not start off putting the different functions to different threads is that doing so naively would result in warp divergence. GPUs at their core are built to do the same thing many times. So "thousands" of different boolean functions does not sound very suitable to GPU computing if there isn't a lot of times each single function is called (32 might be enough). The whole premise reminds me somewhat of this recent question. — paleonix
– paleonix, Commented Nov 14, 2024 at 22:21
@paleonix note the n_steps=100 loop in my example code. The functions are called 100 times in the example (or 200 times if you count both the computation of B and then the computation of A). I anticipate needing n_steps to be in the millions. — mattroos
– mattroos, Commented Nov 14, 2024 at 22:45
But those are sequential and wont help you with keeping the lanes of a warp busy. — paleonix
– paleonix, Commented Nov 14, 2024 at 22:57

mattroos · Accepted Answer · 2024-11-16 17:33:05Z

I doubt that this is the best/fastest solution, but using torch.compile does provide acceleration. I haven't yet tested when scaling up to thousands of Boolean functions.

import torch
from time import time

A = torch.tensor([True, False, True]).to('cuda')  # Initial values.
B = torch.tensor([False, True, True]).to('cuda')  # Values don't matter. Will write over them in the first iteration

@torch.compile
def process(A, B, n_steps):
    for step in range(n_steps):

        # Use values in A to compute new values in B.
        # How to run the three lines below IN PARALLEL?
        B[0] = torch.logical_and(torch.logical_or( A[0], A[1]), A[2])  # func1: Y0 = X0 | X1 & X2
        B[1] = torch.logical_or( torch.logical_or( A[0], A[1]), A[2])  # func2: Y1 = X0 | X1 | X2
        B[2] = torch.logical_and(torch.logical_and(A[0], A[1]), A[2])  # func3: Y2 = X0 & X1 & X2
        # Only after the above three lines above finish their computation (and B has new values), should the lines below be run.

        # Use values in B to compute new values in A.
        # Note that the functions below are identical to the ones above (which may allow for some additional acceleration?)
        # How to run the three lines below IN PARALLEL?
        A[0] = torch.logical_and(torch.logical_or( B[0], B[1]), B[2])  # func1: Y0 = X0 | X1 & X2
        A[1] = torch.logical_or( torch.logical_or( B[0], B[1]), B[2])  # func2: Y1 = X0 | X1 | X2
        A[2] = torch.logical_and(torch.logical_and(B[0], B[1]), B[2])  # func3: Y2 = X0 & X1 & X2
        # Only after the above three lines above finish their computation (and A has new values), should the next loop be run.

    return A

# First run is slow due to compilation
t_start = time()
A = process(A, B, 100)
print(f'{time()-t_start} seconds')

# Runs are faster subsequently, and can be looped over to effectively increase n_steps
t_start = time()
A = process(A, B, 100)
print(f'{time()-t_start} seconds')

Output without @torch.compile decorator:

First run time: 0.12589144706726074 seconds
Second run time: 0.059000492095947266 seconds

Output with @torch.compile decorator:

First run time: 18.201257467269897 seconds
Second run time: 0.007639169692993164 seconds

Collectives™ on Stack Overflow

In pytorch, how can I parallelize of a set of boolean functions that are executed (on a GPU) repeatedly?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related