Copying one array into another using OpenCL on CPU is much slower than C++ code

Question

I compared the performance of an OpenCL code, running on the CPU, which simply copies data from one 2D array into another to a pure C++ code which does the same thing. I used a single workgroup in the OpenCL code to make a fair comparison. I used Intel's OpenCL drivers and the Intel compiler. The OpenCL code is about 5 times slower than the CPU code. The compiler gives the following message for the copy loop:

loop was transformed to memset or memcpy.

Any suggestions on how to get the OpenCL code upto speed with the C++ code?

Thanks

OpenCL host code:

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cmath>
#include <ctime>
#include <CL/cl.hpp>

int main(int argc, char **argv)
{
    // Create the two input vectors
    const int N = 8192;
    double *in = new double[N*N]; 
    double *out = new double[N*N];

    for(int i = 0; i < N; i++)
        for (int j=0; j < N; j++) {
            in[i*N + j] = i + j;
            out[i*N + j] = 0.;
    }


    double time;
    std::clock_t start;
    int niter = 100;

    cl_int cl_err;

    std::vector<cl::Platform> platforms;
    cl_err = cl::Platform::get(&platforms);

    std::vector<cl::Device> devices;
    cl_err = platforms.at(1).getDevices(CL_DEVICE_TYPE_CPU,
                                        &devices);

    cl_context_properties context_properties[3] = {CL_CONTEXT_PLATFORM,
                                    (cl_context_properties)(platforms.at(1)()),
                                                   0};
    cl::Context context = cl::Context(devices, 
                                      context_properties, 
                                      NULL, NULL, &cl_err);

    cl::Buffer buffer_in = cl::Buffer(context, 
                                      CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
                                      N*N*sizeof(double), 
                                      in, &cl_err);

    cl::Buffer buffer_out = cl::Buffer(context, 
                                       CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, 
                                       N*N*sizeof(double),
                                       out, &cl_err);

    cl::CommandQueue queue = cl::CommandQueue(context, devices.at(0), 0, &cl_err);

    std::ifstream sourceFile("vector_copy.cl");
    std::string sourceCode((std::istreambuf_iterator<char>(sourceFile)),
                            std::istreambuf_iterator<char>());
    cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(),
                                sourceCode.length()+1));

    cl::Program program(context, source, &cl_err);

    cl_err = program.build(devices, NULL, NULL, NULL);

    cl::Kernel kernel(program, "vector_copy", &cl_err);

    cl_err = kernel.setArg(0, buffer_in); 
    cl_err = kernel.setArg(1, buffer_out);
    cl_err = kernel.setArg(2, N);

    cl::NDRange global(N);
    cl::NDRange local(N);

    start = std::clock();
    for (int n=0; n < niter; n++) {
        cl_err = queue.enqueueNDRangeKernel(kernel,
                                            cl::NullRange,
                                            global,
                                            local,
                                            NULL, NULL);

        cl_err = queue.finish();
    }

    time =  (std::clock() - start)/(double)CLOCKS_PER_SEC;
    std::cout << "Time/iteration OpenCL (s) = " << time/(double)niter << std::endl;

    return(0);
}

OpenCL kernel code:

__kernel void vector_copy(__global const double* restrict in, 
                          __global double* restrict out,
                         const int N) 
{

    int i = get_global_id(0);
    int j;

    for (j=0; j<N; j++) {
        out[j + N*i] = in[j + N*i];
    }

}

C++ code:

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cmath>
#include <ctime>

const int N = 8192;

int main(int argc, char **argv)
{
    double *in = new double[N*N]; 
    double *out = new double[N*N];
    // Create the two input vectors
    for(int i = 0; i < N; i++)
        for (int j=0; j < N; j++) {
            in[j + N*i] = i + j;
            out[j + N*i] = 0.;
    }


    std::clock_t start;
    int niter = 100;

    start = std::clock();
    for (int n=0; n < niter; n++) {
        for (int i=0; i<N; i++)
            for (int j=0; j<N; j++) {
                out[j + N*i] = in[j + N*i];
            }

    }

    double time =  (std::clock() - start)/(double)CLOCKS_PER_SEC;
    std::cout << "Time/iteration C = " << time/(double)niter << std::endl;

    return(0);
}

Can you supply the disassembly for the loop in both cases? The Intel compiler is quite capable of rearranging your nested loop structure in the pure c++ version, so check the compiled binary or find a way to inhibit it. — Phil H
– Phil H, Commented Nov 20, 2013 at 7:47
doesn't doing the matrix computation on the GPU involve the extra cost of transferring input and output from/to the cpu? — ghembo
– ghembo, Commented Nov 20, 2013 at 9:17
I recommend adding something to the C++ and OpenCL like out[j] = in[j]*2+1; That +1 prevents the compiler from optimizing it away too much, yet it's so light that it shouldn't affect the actual memory transfer timing at all. — sharpneli
– sharpneli, Commented Nov 20, 2013 at 19:13

sharpneli · Accepted Answer · 2013-11-20 13:42:11Z

5

Intel OpenCL compiler is able to vectorize across workgroups. Basically a single function runs, as an example, 8 threads at the same time in different SSE registers.

Your particular kernel does not do that. But it doesn't really matter. I tested your program using Visual Studio 2010 and the latest Intel OpenCL for applications. I was forced to reduce N from 8192 to 4096 because the integrated GPU I have reduces the maximum OpenCL buffer size into 128MB even if just the CPU is used.

My results: Your OpenCL kernel gave me around 6956MB/s of bandwidth. A trivially changed kernel (This is called with N*N as the global size and NULL as the local size because if we don't care about local memory at all then for CPU's we should leave it undefined).

__kernel void vector_copy2(__global const double* restrict in, 
                      __global double* restrict out) 
{
  int i = get_global_id(0);
  out[i] = in[i];
}

Gave about the same result (7006MB/s). This kernel was actually vectorized across threads, as can be verified using the Intel OpenCL kernel compiler. It produces one kernel for a some multiple (like 4) and one kernel for a single thread. Then it just runs the vectorized kernel until it has to run the single thread kernel for the last few workitems.

The C++ code gave 6494MB/s. So it's quite in line. I don't think it would be even possible for the ICC to make it 5x faster.

I noticed in your code you had platforms.at(1), what was at platform 0 in your computer?

Remember that if you don't care about local memory at all (you don't call get_local_id in your kernels) you should treat the local size for enqueueNDRange as a simple magic parameter. Either leave it as NULL or try to find a value that produces the fastest results.

answered Nov 20, 2013 at 13:42

sharpneli

1,62110 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user3011504 Over a year ago

Thanks for the informative reply. I indeed get a much faster C++ code than the OpenCL code even with a NULL local size. Time/iteration OpenCL (s) = 0.240084 Time/iteration C++ = 0.0319615. There might be something wrong with the timing scheme but simply using the time command also gives similar results: real 0m14.284s for OpenCL and real 0m3.594s for the C++ code. How did you get the numbers for the bandwidth? How do those compare to the max bandwidth of L1/L2/L3 caches? Platform 0 is a Nvidia card with no double compute capability and so I am not using it.

DarkZeros Over a year ago

I think the compiler is optimizing it quite brutally O.o. That speed is almost the speed of direct copy call to the OpenCL API.

sharpneli Over a year ago

I got around 0.018something for the OpenCL part and 0.019something for the C++ part (did it on a different computer so cannot remember more precise numbers). I simply divided the transferred data with the time it took to process it. And funnily enough that was slightly faster than enqueuecopybuffer. Those numbers are around the transfer rate of DDR3 ram. It's not too surprising as the buffer is 128MB in my case so it doesn't fit to the cache. Are you using the latest Intel SDK and what CPU do you have?

user3011504 Over a year ago

Yeah, I'm using the latest SDK and my CPU is Intel Core i7 CPU Q820 @ 1.73GHz. Thanks for testing out the code. I guess there might be some anomaly on my system. Since the performance of the two codes match on your system, I'll accept your answer.

DarkZeros · Accepted Answer · 2013-11-20 10:42:22Z

2

The OpenCL code, even if optimized, it will still perform the copy 1by1 (work-item by work-item). Because the OpenCL compiler is only allowed to optimize in a per work item basis. While the C++ case will be optimized by the compiler into a memcpy() call probably (as the compiler is telling you).

If you disable the compiler optimizations it will perform much faster in the GPU.

BTW is there a reason for this? You have memcpy() in C++ and clEnqueueCopyBuffer() in OpenCL for this purpose. I think that latter one is what you should use.

edited Nov 20, 2013 at 10:42

answered Nov 20, 2013 at 10:14

DarkZeros

8,4201 gold badge31 silver badges37 bronze badges

3 Comments

user3011504 Over a year ago

1) The documentation for Intel's OpenCL compiler says that it can vectorize across workitems and pass them into different SIMD lanes. 2) I am using OpenCL on CPU not GPU (see original post) in order to compare a code in C++ and an OpenCL code. 3) PDE solver algorithms are memory bandwidth limited, not compute limited. I wanted to test an extreme case of this where the kernel does 2 loads and 0 computations.

DarkZeros Over a year ago

1) Yes, It can vectorize but that does not mean that it will perform as a memcpy. 2) It does not matter the real device, what it matters is that the OpenCL code cannot be optimized by the compiler, but the C++ is being optimized into a simple memcpy(). You cannot compare both, since a manual copy can never achieve the speed of a DMA copy.

user3011504 Over a year ago

Typically what one does when solving PDEs is something like out[i] = (in[i+1] - in[i-1])/(2.*dx). I simply removed the addition and division operations here and made the memory access easier (same input and output elements instead of adjacent input elements) to get a feel of how OpenCL performs before porting some of the codes I have.

Collectives™ on Stack Overflow

Copying one array into another using OpenCL on CPU is much slower than C++ code

2 Answers 2

4 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related