0

I am currently experimenting with the performance of OpenCL code using the GPU and using C++ on the CPU. I have written programs that compute the sum z = x + y, where z, x,and y are two dimensional arrays (matrices) for the GPU and the CPU. After testing these programs, I have found that the CPU is much more efficient at computing this sum than the GPU due to the slow transfer of data in the PCI bus between the GPU and the CPU. Now I want to determine how many more sums will be required in order to make using the GPU more efficient than the CPU. I plan to do this by increasing the sum z = x + y to z = x + y + y + y + y + ... and so on.

Will it be possible to make using the GPU more efficient than the CPU just by increasing the number of sums for this specific problem?

Just as an FYI: I am using a nVIDIA GeForce GT 640 graphics card and an i5 Intel core CPU.

Any help will be greatly appreciated.

EDIT:

Below I have attached my code on the CPU:

int main(int argc, const char * argv[])
{

    //This value determines the size of the nxn (square array)             
    int n = 1000;

    //Allocating the memory for the nxn arrays of floats.
    float **x = (float**)malloc(sizeof(float*)*n);
    float **y = (float**)malloc(sizeof(float*)*n);
    float **z = (float**)malloc(sizeof(float*)*n);


    //Initializing the arrays.
    for(int i = 0; i<n; i++){
        x[i] = (float*)malloc(sizeof(float)*n);
        y[i] = (float*)malloc(sizeof(float)*n);
        z[i] = (float*)malloc(sizeof(float)*n);

        for(int j = 0; j<n; j++){
            x[i][j] = i+j;
            y[i][j] = i+j;

        }
    }

    for(int i = 0; i<n; i++){
        for(int j = 0; j<n; j++){

            z[i][j] = x[i][j] + y[i][j];
            for(int k = 0; k < 100; k++){
                z[i][j] += y[i][j];
            }
        }
    }

    return 0;

}

And here is the C++ using OpenCL: (used to copy the data and execute the kernel on the GPU)

int n = 1000;

for(int i = 0; i<n; i++)
    {
        //Writing the data from the host to the device
        err = clEnqueueWriteBuffer(queue, d_xx, CL_TRUE, 0, sizeof(float)*n, h_xx[i], 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not write to buffer d_xx" << std::endl;
            exit(1);
        }

        err = clEnqueueWriteBuffer(queue, d_yy, CL_TRUE, 0, sizeof(float)*n, h_yy[i], 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not write to buffer d_yy" << std::endl;
            exit(1);
        }

        //Setting the Kernel Arguments
        err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_xx);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not set kernel argument h_xx." << std::endl;
            exit(1);
        }

        err = clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_yy);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not set kernel argument h_yy." << std::endl;
            exit(1);
        }

        err = clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_zz);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not set kernel argument h_zz." << std::endl;
        }

        work_units_per_kernel = n;

        //Executing the Kernel
        err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &work_units_per_kernel, NULL, 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not execute kernel." << std::endl;
            exit(1);
        }

        //Reading the Data from the Kernel
        err = clEnqueueReadBuffer(queue, d_zz, CL_TRUE, 0, n*(sizeof(float)), h_zz[i], 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not read data from kernel." << std::endl;
            exit(1);
        }

    }

And lastly the kernel code executed on the GPU:

__kernel void arraysum(__global const float *d_aa, __global const float *d_bb, __global float *d_cc)
{

    int i = get_global_id(0);

    d_cc[i] = d_aa[i] + d_bb[i];


    for(int j = 0; j < 100; j++){
        d_cc[i] += d_bb[i];
    }


}
3
  • Why don't you post your code here? GPU wins if you parallalize. Also GPU might have faster memory. Do you parallelize in your program? Because ONE cpu thread will perform better than ONE GPU "thread". Commented Sep 1, 2013 at 3:35
  • @SigTerm Thanks for your reply. I have attached some snippets of my code. I hope they help clarify things as to whether I parallelized in my program or not. Commented Sep 1, 2013 at 4:41
  • 1
    Looks to me like a classic case of doing so little computation that the operation as a whole is memory bound, so unless you can do more with the data on the GPU, bus speed is going to lose more than the GPU itself gains. Commented Sep 1, 2013 at 5:22

2 Answers 2

2

For n = 1000*1000 you are getting to the point where copying, operating, and copying back is worth it. As DarkZero pointed out, Global Memory is NOT optimal, so if you can cache your Global Memory to Local Memory or Thread Memory and use Local Work Groups, this will help tremendously on both the CPU and GPU.

Let's start with the Kernel. d_cc is referenced 100 times from Global Memory. A simple change in this case is to cache the global memory into thread memory, and then at the end copy the local back to global.

 __kernel void arraysum(__global const float *d_aa, __global const float *d_bb, __global float *d_cc)
{

     int i = get_global_id(0);

     float t_d_cc = d_aa[i] + d_bb[i]; //make a thread only version of d_cc

     for(int j = 0; j < 100; j++){
         t_d_cc += d_bb[i];
     }

     d_cc[i] = t_d_cc; //copy the thread only back to global
} 

Another change, dependent on hardware, is to cache d_aa and d_bb into local memory. This lets OpenCL take advantage of batch copies from Global Memory. This can be a bit more challenging because each OpenCL device has different sizes and multiples of Local Workgroup sizes that can be used.

For example, my i5 has a Maximum Workgroup size of 1024 and a Workgroup multiple of 1, so my Local Workgroups can be anything from 1 to 1024. My ATI-7970 has values of 256 and 64, respectively, so my Local Worksgroups needs to be 64, 128, etc. This is much more restrictive.

 __kernel void arraysum(__global const float *d_aa, 
                        __local float *l_d_aa,
                        __global const float *d_bb,
                        __local float *l_d_bb, 
                        __global float *d_cc,
                        __local float *l_d_cc)
{

//In this example, the global_id(1) is the number of rows and global_id(0) is the columns
//So when the kernel is called, the local work group size needs to be the size of the 
//number of columns

int i = get_global_id(1)*get_global_size(0) + get_global_id(0); //Index of the row
int j = get_local_id(0); 

l_d_aa[get_local_id(0)] = d_aa[i];
l_d_bb[get_local_id(0)] = d_bb[i];

read_mem_fence(CLK_LOCAL_MEM_FENCE);

float l_d_cc[get_local_id(0)] = l_d_aa[get_local_id(0)] + l_d_bb[get_local_id(0)]; 

for(int j = 0; j < get_global_size(0); j++){
    l_d_cc[get_local_id(0)] += l_d_bb[j];
}

d_cc[i] = l_d_cc[get_local_id(0)]; //copy the thread only back to global

}

I apologize if I got the algorithm wrong, but hopefully it conveys how to cache global memory to local memory. Again, on the i5, the local workgroup size can be 1 to 1024, but the ATI7970 is restricted to column sizes of 64, 128, etc.

It is conceptually much more difficult, but the performance for OpenCL is much, much better when using this approach.

Community, please feel free clean up the kernel.

Sign up to request clarification or add additional context in comments.

2 Comments

Use of local memory has no much sense here since the accumulator is per work-item. However, the private memory kernel should work quite fine. That kernel + non-blocking calls + very high N value = very good speeds.
I think it is worth it if he can cache an entire row into local, which he can using the CPU. It would be interesting to compare the compiler (gcc or MSVC) code to the OpenCL compiled code on the CPU with the caching.
2

Many things are slowing you down:

1- Abuse of use of global memory. Each global memory access is like 400times slower, and you ONLY use global memory (like 200 read/writes). Global memory must be used only to read at the begining, and write at the end, never as an intermediate value.

2- Your N length is very short. CPU will finish in just 1000 instructions, while all the latencies in the GPU are much slower than this. Because a 100MB copy is much more efficient than a 1 byte copy, there are overheads in the copy operations.

3- Probably, CPU code is being optimized by the compiler into multiplications, while GPU code can't since it is accessing volatile variables like globals.

4- Memory read/writes to/from device are very expensive, if you include this into the calc, CPU will easily win. Also OpenCL buffer and kernels creations are very expensive. Note you are also using blocking write calls, this is much slower than non-blocking calls.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.