0

When multiple CPU thread dispatch jobs to a single GPU, what's the best way to differentiate GPU threads so that the multiple CPU thread does not simply repeat each other

the following code calculate the sum of two large arrays element by element. The correct result is: 3.0. When using 1 CPU, the code do the right thing. Then running with 8 CPUs, the output becomes 10 because the kernel duplicate the calculation 8 times. I'm looking for a way such that each CPU calculate 1/8 of the sum that not duplicate each other.

#include <iostream>
#include <math.h>
#include <thread>
#include <vector>

#include <cuda.h>


using namespace std;

const unsigned NUM_THREADS = std::thread::hardware_concurrency();  

// Kernel function to add the elements of two arrays
__global__
void add_2(int n, float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if(i < n) {
        y[i] = x[i] + y[i];
    }
}

//
void thread_func(int N, float *x, float *y, int idx_thread)
{   
    cudaSetDevice(0); 

    int blockSize;   
    int minGridSize; 
    int gridSize;    

    cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, add_2, 0, N);
    // Round up according to array size
    gridSize = (N + blockSize - 1) / blockSize;
    //gridSize /= NUM_THREADS +1;  

    cout<<"blockSize: "<<blockSize<<" minGridSize: "<<minGridSize<<" gridSize: "<<gridSize<<endl;

    // Run kernel on 1M elements on the GPU
    add_2<<<gridSize, blockSize>>>(N, x, y);


    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();
}


//
int main()
{

    int N = 1<<20;
    float *x, *y;
    // Allocate Unified Memory – accessible from CPU or GPU
    cudaMallocManaged(&x, N*sizeof(float));
    cudaMallocManaged(&y, N*sizeof(float));

    // initialize x and y arrays on the host
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    //.. begin multithreading ..
    vector<std::thread> t;
    for(int i = 0; i<NUM_THREADS; i++)
        t.push_back(thread(thread_func, N, x, y, i));

    for(int i = 0; i<NUM_THREADS; i++)
        t[i].join();

    // Check for errors (all values should be 3.0f)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++) {
        if(!(i%10000))
            std::cout<<i<<" "<<y[i]<<std::endl;
        maxError = fmax(maxError, fabs(y[i]-3.0f));
    }
    std::cout << "Max error: " << maxError << std::endl;

    // Free memory
    cudaFree(x);
    cudaFree(y);

    return 0;
}

Output:

blockSize: 1024 minGridSize: 16 gridSize: 1024

..........

blockSize: 1024 minGridSize: 16 gridSize: 1024

0 10

10000 10

20000 10

...

1020000 10

1030000 10

1040000 10

Max error: 7

2 Answers 2

3

The solution for this very simple case is to divide up your array into pieces, one piece per thread. For simplicity so that I don't have to handle a bunch of annoying corner case issues, lets assume that your array size (N) is whole-number divisible by NUM_THREADS. It doesn't have to be this way, of course, but the arithmetic to divide it up isn't much different, but you have to handle rounding at each segment boundary, which I'd rather avoid.

Here's an example that works based on the above assumption. Each thread decides which portion of the array it is responsible for (based on its thread number and the total length) and only works on that section.

$ cat t1460.cu
#include <iostream>
#include <math.h>
#include <thread>
#include <vector>

#include <cuda.h>


using namespace std;

const unsigned NUM_THREADS = 8;

// Kernel function to add the elements of two arrays
__global__
void add_2(int n, float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if(i < n) {
        y[i] = x[i] + y[i];
    }
}

//
void thread_func(int N, float *x, float *y, int idx_thread)
{
    cudaSetDevice(0);

    int blockSize = 512;
    int worksize = N/NUM_THREADS; // assumes whole-number divisibility
    int gridSize = (worksize+blockSize-1)/blockSize;
    cout<<"blockSize: "<<blockSize<<" gridSize: "<<gridSize<<endl;

    // Run kernel on 1M elements on the GPU
    add_2<<<gridSize, blockSize>>>(worksize, x+(idx_thread*worksize), y+(idx_thread*worksize));


    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();
}


//
int main()
{

    int N = 1<<20;
    float *x, *y;
    // Allocate Unified Memory – accessible from CPU or GPU
    cudaMallocManaged(&x, N*sizeof(float));
    cudaMallocManaged(&y, N*sizeof(float));

    // initialize x and y arrays on the host
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    //.. begin multithreading ..
    vector<std::thread> t;
    for(int i = 0; i<NUM_THREADS; i++)
        t.push_back(thread(thread_func, N, x, y, i));

    for(int i = 0; i<NUM_THREADS; i++)
        t[i].join();

    // Check for errors (all values should be 3.0f)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++) {
        if(!(i%10000))
            std::cout<<i<<" "<<y[i]<<std::endl;
        maxError = fmaxf(maxError, fabs(y[i]-3.0f));
    }
    std::cout << "Max error: " << maxError << std::endl;

    // Free memory
    cudaFree(x);
    cudaFree(y);

    return 0;
}
$ nvcc t1460.cu -o t1460 -std=c++11
$ cuda-memcheck ./t1460
========= CUDA-MEMCHECK
blockSize: blockSize: 512 gridSize: 256512blockSize:  gridSize:
blockSize: blockSize: 512blockSize:  gridSize: 256512
 gridSize: 256
blockSize: 512 gridSize: 256
blockSize: 512 gridSize: 256
512 gridSize: 256
256
512 gridSize: 256
0 3
10000 3
20000 3
30000 3
40000 3
50000 3
60000 3
70000 3
80000 3
90000 3
100000 3
110000 3
120000 3
130000 3
140000 3
150000 3
160000 3
170000 3
180000 3
190000 3
200000 3
210000 3
220000 3
230000 3
240000 3
250000 3
260000 3
270000 3
280000 3
290000 3
300000 3
310000 3
320000 3
330000 3
340000 3
350000 3
360000 3
370000 3
380000 3
390000 3
400000 3
410000 3
420000 3
430000 3
440000 3
450000 3
460000 3
470000 3
480000 3
490000 3
500000 3
510000 3
520000 3
530000 3
540000 3
550000 3
560000 3
570000 3
580000 3
590000 3
600000 3
610000 3
620000 3
630000 3
640000 3
650000 3
660000 3
670000 3
680000 3
690000 3
700000 3
710000 3
720000 3
730000 3
740000 3
750000 3
760000 3
770000 3
780000 3
790000 3
800000 3
810000 3
820000 3
830000 3
840000 3
850000 3
860000 3
870000 3
880000 3
890000 3
900000 3
910000 3
920000 3
930000 3
940000 3
950000 3
960000 3
970000 3
980000 3
990000 3
1000000 3
1010000 3
1020000 3
1030000 3
1040000 3
Max error: 0
========= ERROR SUMMARY: 0 errors
$

Of course, for this trivial example, there's no particular benefit to using 4 CPU threads. I assume that what was being asked here was for a design pattern to enable other activity. Multiple CPU threads might be a convenient way to arrange other work. For example, I might have a system that is processing data from 4 cameras. It might be convenient to organize my camera processing as 4 independent threads, one for each camera. That system might only have 1 GPU, and it's certainly plausible that each of the 4 threads might want to issue independent work to that GPU. This design pattern could easily be adapted to that use case, to pick one example. It might even be that the 4 camera CPU threads would need to combine some data into a single array on the GPU, and this pattern could be used in that case.

Sign up to request clarification or add additional context in comments.

Comments

1

When multiple CPU thread dispatch jobs to a single GPU, what's the best way to differentiate GPU threads so that the multiple CPU thread does not simply repeat each other

Let me answer that more generally than regarding your specific example:

  • There is no inherent benefit in using multiple threads to enqueue work on a GPU. If you have each thread wait on a CUDA queue, then it could make sense, but that's not necessarily the right thing to do.
  • Unless you're explicitly scheduling memory transfers, there's no guaranteed inherent benefit to split up the work you schedule into small pieces. You could just schedule a single kernel to add up the entire array. Remember - a kernel is made up of thousands or millions of 'threads' on the GPU side; CPU threads don't help GPU parallelism at all.
  • It makes more sense to have different threads schedule work when they come to realize it exists independently of each other.
  • It is often a good idea to write a kernel's output someplace different than its input. It requires more memory during the computation, but it prevents the kind of problems you describe - of overlapping changes of the same value, of having to carefully consider which scheduled kernel executes first etc. Thus, for example, you could have implemented:
    __global__ void add_2(int  n, float*  result, const float *x, const float *y)
    {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i < n) {
            z[i] = x[i] + y[i];
        }
    }
    
    if you can't do that, then you need careful partitioning of the the input-output arrays to schedule the work, as suggested in @RobertCrovella's answer.
  • Use the __restrict__ keyword (even though it's not standard C++) to indicate that the areas the parameters point to don't overlap. That speeds things up. See:

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.