0

I'm just getting started with OpenCL, so I'm sure there's a dozen things I can do to improve this code, but one thing in particular is standing out to me: If I sum columns rather than rows (basically contiguous versus strided, because all buffers are linear) in a 2D array of data, I get different run times by a factor of anywhere from 2 to 10x. Strangely, the contiguous access appear slower.

I'm using PyOpenCL to test.

Here's the two kernels of interest (reduce and reduce2), and another that's generating the data feeding them (forcesCL):

kernel void forcesCL(global float4 *chrgs, global float4 *chrgs2,float k, global float4 *frcs)
{
    int i=get_global_id(0);
    int j=get_global_id(1);
    int idx=i+get_global_size(0)*j;

    float3 c=chrgs[i].xyz-chrgs2[j].xyz;
    float l=length(c);
    frcs[idx].xyz= (l==0 ? 0 
                         : ((chrgs[i].w*chrgs2[j].w)/(k*pown(l,2)))*normalize(c));
    frcs[idx].w=0;
}

kernel void reduce(global float4 *frcs,ulong k,global float4 *result)
{
    ulong gi=get_global_id(0);
    ulong gs=get_global_size(0);
    float3 tmp=0;

    for(ulong i=0;i<k;i++)
        tmp+=frcs[gi+i*gs].xyz;
    result[gi].xyz=tmp;
}

kernel void reduce2(global float4 *frcs,ulong k,global float4 *result)
{
    ulong gi=get_global_id(0);
    ulong gs=get_global_size(0);
    float3 tmp=0;

    for(ulong i=0;i<k;i++)
        tmp+=frcs[gi*gs+i].xyz;
    result[gi].xyz=tmp;
}

It's the reduce kernels that are of interest here. The forcesCL kernel just estimates the Lorenz force between two charges where the position of each is encoded in the xyz component of a float4, and the charge in the w component. The physics isn't important, it's just a toy to play with OpenCL.

I won't go through the PyOpenCL setup unless asked, other than to show the build step:

program=cl.Program(context,'\n'.join((src_forcesCL,src_reduce,src_reduce2))).build()

I then setup of the arrays with random positions and the elementary charge:

a=np.random.rand(10000,4).astype(np.float32)
a[:,3]=np.float32(q)
b=np.random.rand(10000,4).astype(np.float32)
b[:,3]=np.float32(q)

Setup scratch space and allocations for return values:

c=np.empty((10000,10000,4),dtype=np.float32)
cc=cl.Buffer(context,cl.mem_flags.READ_WRITE,c.nbytes)
r=np.empty((10000,4),dtype=np.float32)
rr=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,r.nbytes)
s=np.empty((10000,4),dtype=np.float32)
ss=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,s.nbytes)

Then I try to run this in each of two ways-- once using reduce(), and once reduce2(). The only difference should be in which axis I'm summing:

%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)

evt1=program.forcesCL(queue,c.shape[0:2],None,aa,bb,k,cc)
evt2=program.reduce(queue,r.shape[0:1],None,cc,np.uint32(b.shape[0:1]),rr,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,r,rr,wait_for=[evt2],is_blocking=True)

Notice I've swapped the arguments to forcesCL so I can check the results against the first method:

%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)

evt1=program.forcesCL(queue,c.shape[0:2],None,bb,aa,k,cc)
evt2=program.reduce2(queue,s.shape[0:1],None,cc,np.uint32(a.shape[0:1]),ss,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,s,ss,wait_for=[evt2],is_blocking=True)

The version using the reduce() kernel gives me times on the order of 140ms, the version using the reduce2() kernel gives me times on the order of 360ms. The values returned are the same, save a sign change because they're being subtracted in the opposite order.

If I do the forcesCL step once, and run the two reduce kernels, the difference is much more pronounced-- on the order of 30ms versus 250ms.

I wasn't expecting any difference, but if I were I would have expected the contiguous accesses to perform better, not worse.

Can anyone give me some insight into what's happening here?

Thanks!

4
  • give it a try and set up tmpas float4 in your reduce-kernels, and set the w-component to 0. Do not add .xyz, but use the "full" vector instead in the loop. Only use the subvector-pattern (.xyz) at the output step. Commented May 22, 2015 at 5:50
  • Thanks, but declaring tmp as float4 doesn't seem to have any measurable impact on run time. Commented May 22, 2015 at 6:00
  • What hardware does your kernel run on? Commented May 22, 2015 at 6:03
  • Nvidia GeForce GT 750M, on a Apple 15" Retina MacBook Pro, using the Apple OpenCL drivers. Commented May 22, 2015 at 6:10

1 Answer 1

4

This is a clear case of example of coalescence. It is not about how the index is used (in the rows or columns), but how the memory is accessed in the HW. Just a matter of following step by step how the real accesses are performed and in which order.

Lets analyze it properly:

Suppose Work-Items are divided in local blocks of size N.

For the first case:

WI_0 will read 0, Gs, 2Gs, 3Gs, .... (k-1)Gs
WI_1 will read 1, Gs+1, 2Gs+1, 3Gs+1, .... (k-1)Gs+1
...

Since each of this WI is run in parallel, their memory accesses occur at the same time. So, the memory controller is requested:

First iteration: 0, 1, 2, 3 ... N-1  -> Groups into few memory access
Second iteration: Gs, Gs+1, Gs+2, ... Gs+N-1  ->  Groups into few memory access
...

In that case, in each iteration, the memory controller groups all the N WI requests into a big 256bits reads/writes to global. There is no need to cache, since after processing the data it can be discarded.

For the second case:

WI_0 will read 0, 1, 2, 3, .... (k-1)
WI_1 will read Gs, Gs+1, Gs+2, Gs+3, .... Gs+(k-1)   
...

The memory controller is requested:

First iteration: 0, Gs, 2Gs, 3Gs -> Scattered read, no grouping
Second iteration: 1, Gs+1, 2Gs+1, 3Gs+1 ->Scattered read, no grouping
...

In this case, the memory controller is not working in a proper mode. It would work if the cache memory is infinite, but it is not. It can cache some reads thanks to the inter-workitems memory requested being the same sometimes (due to the k size for loop) but not all of them.


When you reduce the value of k, you reduce the amount of cache reuses that is possibleI. And this lead to even bigger differences between the column and row access modes.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for the clear explanation! Is there a straightforward way to remedy this, or do I need to change the algorithm more deeply?
You don't need to change anything, Just use the reduce2() method. Which exploits the coalesced memory access. Thats all. Unless you really need the data output to be in some shape, in that case you may have to change minor stuff elsewere
( reduce() outperforms reduce2()-- I think that matches your explanation but you fell victim to my dastardly naming scheme.) I need to sum both along the rows and the columns. Right now it's faster to run forcesCL() twice, swapping row and column and reading out with reduce() both times than to run it once and sum rows and columns on the resulting buffer. I can do that, but it feels like there must be a better way since forcesCL is generating a 400million element buffer... It won't help to read chunks into local memory because that will be element by element as well, right?
Sry, yes reduce() is the good one I messed up :) . Local memory can help of course. Implementing a cache mechanism yourselve for the reduce2() approach will partially solve the problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.