1

If multiple threads are simultaneously writing a single memory location.,there will be a race condition,right?? In my case same is happening..

Consider a module from 'reduce.cl'

int i = get_global_id(0);
int n,j;

n = keyMobj[i];                       // this n is the key..It can be either 0 or 1.
for(j=0; j<2; j++)
      sumMobj[n*2+j] += dataMobj[i].dattr[j];        //summing operation.

Here, The memory locations
sumMobj===> [...0..., ....1...] is accessed 4 threads simultaneously & sumMobj===> [....3..., ....4...] is accessed 6 threads simultaneously..

Is there any way to still make it parallely,like using locking or semaphore? As this summing is a very big part in my algorithm...

5
  • these are the definitions of sumMobj and dataMobj typedef struct data { double dattr[10]; int d_id; int bestCent; }Data; Data *dataMboj; and double *sumMobj = (double *)malloc(sizeof(double) * 2 * 2); Commented Jan 21, 2013 at 9:08
  • @talonmies It is actually parallel addition problem..in opencl kernel. I just don't know the feasible solution. Commented Jan 21, 2013 at 9:13
  • Why don't you use barrier if you suspect there is a race condition ? like barrier(CLK_LOCAL_MEM_FENCE); Commented Jan 21, 2013 at 9:27
  • @ocluser I am having multiple threads which are simultaneously accessing(writing) a single memory location. Would this function 'barrier(CLK_LOCAL_MEM_FENCE);' be useful in this case? I have not used it before. Commented Jan 21, 2013 at 11:07
  • found this page which explains a method for atomically adding floating point numbers, however you would need to use the cl_khr_int64_base_atomics, and use unions of longs and doubles. Commented Jan 23, 2013 at 16:51

1 Answer 1

3

I can give you some hint as I was also facing similar problem.

I can think of three different methods for achieving similar goal:

Consider a simple kernel, assuming you launched 4 (0-3) threads

_kernel void addition (int *p)
{
int i = get_local_id(0);
     p[4]+= p[i];
}

You want to add values p[0], p[1], p[2], p[3], p[4], and store the final sum in p[4]. right? i.e:

p[4]= p[0] + p[1] + p[2] + p[3] + p[4] 

Method -1 (no parallelism)

Assign this job to only 1 thread (no parallelism):

int i = get_local_id(0);
if (i==0)

{

p[4]+= p[i];

} 

Method-2 (with parallelism)

Express your problem as follows:

p[4]= p[0] + p[1] + p[2] + p[3] + p[4] + 0  

This is a reduction problem

So launch 3 threads: i=0 to i=2. In first iteration

 i=0 finds p[0] + p[1]
 i=1 finds p[2] + p[3]  
 i=2 finds p[4] + 0

Now you have three numbers, you apply the same logic as above and add these numbers (with suitable padding of 0 to make it in power of two)

Method -3 Atomic operations

If you still need to implement this atomically, you can use atomic_add():

  int fsfunc atomic_add (   volatile __global int *p ,int val)

Description

Read the 32-bit value (referred to as old) stored at location pointed by p. Compute (old + val) and store result at location pointed by p. The function returns old.

This is assuming the data is int type. Otherwise you can see the link as suggested above.

Sign up to request clarification or add additional context in comments.

2 Comments

It means if I am using floating point operations,then I don't any other option than using the reduction method, as there isn't any provision of extensions for floating point based atomic operations (as far as my information in concerned). Am I right?
Apart from reduction you can also see the link posted by @Slicedpan

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.