2

I compared the performance of an OpenCL code, running on the CPU, which simply copies data from one 2D array into another to a pure C++ code which does the same thing. I used a single workgroup in the OpenCL code to make a fair comparison. I used Intel's OpenCL drivers and the Intel compiler. The OpenCL code is about 5 times slower than the CPU code. The compiler gives the following message for the copy loop:

loop was transformed to memset or memcpy.

Any suggestions on how to get the OpenCL code upto speed with the C++ code?

Thanks

OpenCL host code:

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cmath>
#include <ctime>
#include <CL/cl.hpp>

int main(int argc, char **argv)
{
    // Create the two input vectors
    const int N = 8192;
    double *in = new double[N*N]; 
    double *out = new double[N*N];

    for(int i = 0; i < N; i++)
        for (int j=0; j < N; j++) {
            in[i*N + j] = i + j;
            out[i*N + j] = 0.;
    }


    double time;
    std::clock_t start;
    int niter = 100;

    cl_int cl_err;

    std::vector<cl::Platform> platforms;
    cl_err = cl::Platform::get(&platforms);

    std::vector<cl::Device> devices;
    cl_err = platforms.at(1).getDevices(CL_DEVICE_TYPE_CPU,
                                        &devices);

    cl_context_properties context_properties[3] = {CL_CONTEXT_PLATFORM,
                                    (cl_context_properties)(platforms.at(1)()),
                                                   0};
    cl::Context context = cl::Context(devices, 
                                      context_properties, 
                                      NULL, NULL, &cl_err);

    cl::Buffer buffer_in = cl::Buffer(context, 
                                      CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
                                      N*N*sizeof(double), 
                                      in, &cl_err);

    cl::Buffer buffer_out = cl::Buffer(context, 
                                       CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, 
                                       N*N*sizeof(double),
                                       out, &cl_err);

    cl::CommandQueue queue = cl::CommandQueue(context, devices.at(0), 0, &cl_err);

    std::ifstream sourceFile("vector_copy.cl");
    std::string sourceCode((std::istreambuf_iterator<char>(sourceFile)),
                            std::istreambuf_iterator<char>());
    cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(),
                                sourceCode.length()+1));

    cl::Program program(context, source, &cl_err);

    cl_err = program.build(devices, NULL, NULL, NULL);

    cl::Kernel kernel(program, "vector_copy", &cl_err);

    cl_err = kernel.setArg(0, buffer_in); 
    cl_err = kernel.setArg(1, buffer_out);
    cl_err = kernel.setArg(2, N);

    cl::NDRange global(N);
    cl::NDRange local(N);

    start = std::clock();
    for (int n=0; n < niter; n++) {
        cl_err = queue.enqueueNDRangeKernel(kernel,
                                            cl::NullRange,
                                            global,
                                            local,
                                            NULL, NULL);

        cl_err = queue.finish();
    }

    time =  (std::clock() - start)/(double)CLOCKS_PER_SEC;
    std::cout << "Time/iteration OpenCL (s) = " << time/(double)niter << std::endl;

    return(0);
}

OpenCL kernel code:

__kernel void vector_copy(__global const double* restrict in, 
                          __global double* restrict out,
                         const int N) 
{

    int i = get_global_id(0);
    int j;

    for (j=0; j<N; j++) {
        out[j + N*i] = in[j + N*i];
    }

}

C++ code:

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cmath>
#include <ctime>

const int N = 8192;

int main(int argc, char **argv)
{
    double *in = new double[N*N]; 
    double *out = new double[N*N];
    // Create the two input vectors
    for(int i = 0; i < N; i++)
        for (int j=0; j < N; j++) {
            in[j + N*i] = i + j;
            out[j + N*i] = 0.;
    }


    std::clock_t start;
    int niter = 100;

    start = std::clock();
    for (int n=0; n < niter; n++) {
        for (int i=0; i<N; i++)
            for (int j=0; j<N; j++) {
                out[j + N*i] = in[j + N*i];
            }

    }

    double time =  (std::clock() - start)/(double)CLOCKS_PER_SEC;
    std::cout << "Time/iteration C = " << time/(double)niter << std::endl;

    return(0);
}
4
  • Can you supply the disassembly for the loop in both cases? The Intel compiler is quite capable of rearranging your nested loop structure in the pure c++ version, so check the compiled binary or find a way to inhibit it. Commented Nov 20, 2013 at 7:47
  • doesn't doing the matrix computation on the GPU involve the extra cost of transferring input and output from/to the cpu? Commented Nov 20, 2013 at 9:17
  • I recommend adding something to the C++ and OpenCL like out[j] = in[j]*2+1; That +1 prevents the compiler from optimizing it away too much, yet it's so light that it shouldn't affect the actual memory transfer timing at all. Commented Nov 20, 2013 at 19:13
  • That didn't really change the result. Commented Nov 20, 2013 at 21:45

2 Answers 2

5

Intel OpenCL compiler is able to vectorize across workgroups. Basically a single function runs, as an example, 8 threads at the same time in different SSE registers.

Your particular kernel does not do that. But it doesn't really matter. I tested your program using Visual Studio 2010 and the latest Intel OpenCL for applications. I was forced to reduce N from 8192 to 4096 because the integrated GPU I have reduces the maximum OpenCL buffer size into 128MB even if just the CPU is used.

My results: Your OpenCL kernel gave me around 6956MB/s of bandwidth. A trivially changed kernel (This is called with N*N as the global size and NULL as the local size because if we don't care about local memory at all then for CPU's we should leave it undefined).

__kernel void vector_copy2(__global const double* restrict in, 
                      __global double* restrict out) 
{
  int i = get_global_id(0);
  out[i] = in[i];
}

Gave about the same result (7006MB/s). This kernel was actually vectorized across threads, as can be verified using the Intel OpenCL kernel compiler. It produces one kernel for a some multiple (like 4) and one kernel for a single thread. Then it just runs the vectorized kernel until it has to run the single thread kernel for the last few workitems.

The C++ code gave 6494MB/s. So it's quite in line. I don't think it would be even possible for the ICC to make it 5x faster.

I noticed in your code you had platforms.at(1), what was at platform 0 in your computer?

Remember that if you don't care about local memory at all (you don't call get_local_id in your kernels) you should treat the local size for enqueueNDRange as a simple magic parameter. Either leave it as NULL or try to find a value that produces the fastest results.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the informative reply. I indeed get a much faster C++ code than the OpenCL code even with a NULL local size. Time/iteration OpenCL (s) = 0.240084 Time/iteration C++ = 0.0319615. There might be something wrong with the timing scheme but simply using the time command also gives similar results: real 0m14.284s for OpenCL and real 0m3.594s for the C++ code. How did you get the numbers for the bandwidth? How do those compare to the max bandwidth of L1/L2/L3 caches? Platform 0 is a Nvidia card with no double compute capability and so I am not using it.
I think the compiler is optimizing it quite brutally O.o. That speed is almost the speed of direct copy call to the OpenCL API.
I got around 0.018something for the OpenCL part and 0.019something for the C++ part (did it on a different computer so cannot remember more precise numbers). I simply divided the transferred data with the time it took to process it. And funnily enough that was slightly faster than enqueuecopybuffer. Those numbers are around the transfer rate of DDR3 ram. It's not too surprising as the buffer is 128MB in my case so it doesn't fit to the cache. Are you using the latest Intel SDK and what CPU do you have?
Yeah, I'm using the latest SDK and my CPU is Intel Core i7 CPU Q820 @ 1.73GHz. Thanks for testing out the code. I guess there might be some anomaly on my system. Since the performance of the two codes match on your system, I'll accept your answer.
2

The OpenCL code, even if optimized, it will still perform the copy 1by1 (work-item by work-item). Because the OpenCL compiler is only allowed to optimize in a per work item basis. While the C++ case will be optimized by the compiler into a memcpy() call probably (as the compiler is telling you).

If you disable the compiler optimizations it will perform much faster in the GPU.

BTW is there a reason for this? You have memcpy() in C++ and clEnqueueCopyBuffer() in OpenCL for this purpose. I think that latter one is what you should use.

3 Comments

1) The documentation for Intel's OpenCL compiler says that it can vectorize across workitems and pass them into different SIMD lanes. 2) I am using OpenCL on CPU not GPU (see original post) in order to compare a code in C++ and an OpenCL code. 3) PDE solver algorithms are memory bandwidth limited, not compute limited. I wanted to test an extreme case of this where the kernel does 2 loads and 0 computations.
1) Yes, It can vectorize but that does not mean that it will perform as a memcpy. 2) It does not matter the real device, what it matters is that the OpenCL code cannot be optimized by the compiler, but the C++ is being optimized into a simple memcpy(). You cannot compare both, since a manual copy can never achieve the speed of a DMA copy.
Typically what one does when solving PDEs is something like out[i] = (in[i+1] - in[i-1])/(2.*dx). I simply removed the addition and division operations here and made the memory access easier (same input and output elements instead of adjacent input elements) to get a feel of how OpenCL performs before porting some of the codes I have.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.