GPU sorting vs CPU sorting

Question

I made a very naive implementation of the mergesort algorithm, which i turned to work on CUDA with very minimal implementation changes, the algorith code follows:

//Merge for mergesort
__device__ void merge(int* aux,int* data,int l,int m,int r)
{
    int i,j,k;
    for(i=m+1;i>l;i--){
        aux[i-1]=data[i-1];
    }
    //Copy in reverse order the second subarray
    for(j=m;j<r;j++){
        aux[r+m-j]=data[j+1];
    }
    //Merge
    for(k=l;k<=r;k++){
        if(aux[j]<aux[i] || i==(m+1))
            data[k]=aux[j--];
        else
            data[k]=aux[i++];
    }
}

//What this code do is performing a local merge
//of the array
__global__
void basic_merge(int* aux, int* data,int n)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    int tn = n / (blockDim.x*gridDim.x);
    int l = i * tn;
    int r = l + tn;
    //printf("Thread %d:  %d,%d: \n",i,l,r);
    for(int i{1};i<=(tn/2)+1;i*=2)
        for(int j{l+i};j<(r+1);j+=2*i)
        {
            merge(aux,data,j-i,j-1,j+i-1);
        }
    __syncthreads();
    if(i==0){
        //Complete the merge
        do{
            for(int i{tn};i<(n+1);i+=2*tn)
                merge(aux,data,i-tn,i-1,i+tn-1);
            tn*=2;
        }while(tn<(n/2)+1);
    }
}

The problem is that no matter how many threads i launch on my GTX 760, the sorting performance is always much much more worst than the same code on CPU running on 8 threads (My CPU have hardware support for up to 8 concurrent threads).

For example, sorting 150 million elements on CPU takes some hundred milliseconds, on GPU up to 10 minutes (even with 1024 threads per block)! Clearly i'm missing some important point here, can you please provide me with some comment? I strongly suspect the the problem is in the final merge operation performed by the first thread, at that point we have a certain amount of subarray (the exact amount depend on the number of threads) which are sorted and need to me merged, this is completed by just one thread (one tiny GPU thread).

I think i should use come kind of reduction here, so each thread perform in parallel further more merge, and the "Complete the merge" step just merge the last two sorted subarray..

I'm very new to CUDA.

EDIT (ADDENDUM):

Thanks for the link, I must admit I still need some time to learn better CUDA before taking full advantage of that material.. Anyway, I was able to rewrite the sorting function in order to take advantage as long as possible of multiple threads, my first implementation had a bottleneck in the last phase of the merge procedure, which was performed by only one multiprocessor.

Now after the first merge, I use each time up to (1/2)*(n/b) threads, where n is the amount of data to sort and b is the size of the chunk of data sorted by each threads.

The improvement in performance is surprising, using only 1024 threads it takes about ~10 seconds to sort 30 milion element.. Well, this is still a poor result unfortunately! The problem is in the threads syncronization, but first things first, let's see the code:

__global__
void basic_merge(int* aux, int* data,int n)
{
    int k = blockIdx.x*blockDim.x + threadIdx.x;
    int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1;
    b = pow( (float)2, b);
    int l=k*b;
    int r=min(l+b-1,n-1);
    __syncthreads();
    for(int m{1};m<=(r-l);m=2*m)
    {
        for(int i{l};i<=r;i+=2*m)
        {
            merge(aux,data,i,min(r,i+m-1),min(r,i+2*m-1));
        }
    }
    __syncthreads();
    do{
        if(k<=(n/b)*.5)
        {
            l=2*k*b;
            r=min(l+2*b-1,n-1);
            merge(aux,data,l,min(r,l+b-1),r);
        }else break;
        __syncthreads();
        b*=2;
    }while((r+1)<n);
}

The function 'merge' is the same as before. Now the problem is that I'm using only 1024 threads instead of the 65000 and more I can run on my CUDA device, the problem is that __syncthreads does not work as sync primitive at grid level, but only at block level!

So i can syncronize up to 1024 threads,that is the amount of threads supported per block. Without a proper syncronization each thread mess up the data of the other, and the merging procedure does not work.

In order to boost the performance I need some kind of syncronization between all the threads in the grid, seems that no API exist for this purpose, and i read about a solution which involve multiple kernel launch from the host code, using the host as barrier for all the threads.

I have a certain plan on how to implement this tehcnique in my mergesort function, I will provide you with the code in the near future. Did you have any suggestion on your own?

Thanks

Some problems are not well-suited for a GPU at all. This would be one. — gnasher729
– gnasher729, Commented May 8, 2015 at 12:20
@gnasher729 nvlabs.github.io/moderngpu/mergesort.html demonstrates that mergesort can work on a GPU just fine. — btilly
– btilly, Commented May 8, 2015 at 15:22

SunsetQuest · Accepted Answer · 2015-05-23 22:36:59Z

1

It looks like all the work is being done in __global __ memory. Each write takes a long time and each read takes a long time making the function slow. I think it would help to maybe first copy your data to __shared __ memory first and then do the work in there and then when the sorting is completed(for that block) copy the results back to global memory.

Global memory takes about 400 clock cycles (or about 100 if the data happens to be in L2 cache). Shared memory on the other hand only takes 1-3 clock cycles to write and read.

The above would help with performance a lot. Some other super minor things you can try are.. (1) remove the first __syncthreads(); It is not really doing anything because no data is being past in between warps at that point. (2) Move the "int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1; b = pow( (float)2, b);" outside the kernel and just pass in b instead. This is being calculated over and over when it really only needs to be calculated once.

I tried to follow along on your algorithm but was not able to. The variable names were hard to follow...or... your code is above my head and I cannot follow. =) Hope the above helps.

answered May 23, 2015 at 22:36

SunsetQuest

9,0352 gold badges52 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

fjanisze Over a year ago

I will try the outcome of your suggestion, thanks. I'm just wandering how to manage properly the data movement between the shared memory (which is about 64KiB) and the big chunk of memory which belong to each Block which turn out to be much bigger. You should also consider that i need memory for the additional array 'aux' since the merge is not performed implace (this makes even more difficult to do the job with only the small shared memory), although i will try to implement the inplace merging when i find some free time. I will keep you informed :) Thanks!

SunsetQuest Over a year ago

The Shared memory has 16KB, 32KB or 48KB and is configurable. I don't think you can get 64KB so you might need to work without that. GPU programming is complex because of all these things. Working in __shared __ memory, having to do coalescing memory access, and having to do work with warps. There is a nice 8-20X payoff however if done correctly but it 5-20 times more work then a quick c++ cpu function.

Collectives™ on Stack Overflow

GPU sorting vs CPU sorting

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related