Optimizing 2D convolution filter with C++ AMP

Question

I'm fairly new to GPU programming and C++ AMP. Can anyone help make a general optimized 2D image convolution filter? My fasted version so far is listed below. Can this be done better with tiling in some way? This version works and is much faster than my CPU implementation but I hope to get it even better.

void FIRFilterCore(array_view<const float, 2> src, array_view<float, 2> dst, array_view<const float, 2> kernel)
{
    int vertRadius = kernel.extent[0] / 2;
    int horzRadius = kernel.extent[1] / 2;

    parallel_for_each(src.extent, [=](index<2> idx) restrict(amp)
    {
        float sum = 0;
        if (idx[0] < vertRadius || idx[1] < horzRadius ||
            idx[0] >= src.extent[0] - vertRadius || idx[1] >= src.extent[1] - horzRadius)
        {
            // Handle borders by duplicating edges
            for (int dy = -vertRadius; dy <= vertRadius; dy++)
            {
                index<2> srcIdx(direct3d::clamp(idx[0] + dy, 0, src.extent[0] - 1), 0);
                index<2> kIdx(vertRadius + dy, 0);
                for (int dx = -horzRadius; dx <= horzRadius; dx++)
                {
                    srcIdx[1] = direct3d::clamp(idx[1] + dx, 0, src.extent[1] - 1);
                    sum += src[srcIdx] * kernel[kIdx];
                    kIdx[1]++;
                }
            }
        }
        else // Central part
        {
            for (int dy = -vertRadius; dy <= vertRadius; dy++)
            {
                index<2> srcIdx(idx[0] + dy, idx[1] - horzRadius);
                index<2> kIdx(vertRadius + dy, 0);
                for (int dx = -horzRadius; dx <= horzRadius; dx++)
                {                   
                    sum += src[srcIdx] * kernel[kIdx];
                    srcIdx[1]++;
                    kIdx[1]++;
                }
            }
        }
        dst[idx] = sum;
    });
}

Another way to go around it would of course be to perform the convolution in the Fourier domain, but I'm not sure it would perform as long as the filter is fairly small compared to the image (which does not have side lengths which are powers of 2 by the way).

Community · Accepted Answer · 2017-05-23 12:20:19Z

1

You can find a complete implementation of the Cartoonizer algorithm. which implements a couple of stencil based algorithms on Codeplex. http://ampbook.codeplex.com/

This includes several different implementations. The tradeoffs associated with them are discussed in the book that the samples were written for.

For the minimum frame processor settings (1 simplifier phase and a border width of 1), there is insufficient shared memory access to take advantage of tiled memory. This is clearly shown by comparing the times taken by the cartoonizing stage for the C++ AMP simple model (4.9 ms) and the tiled model (4.2 ms) running on a single GPU. You would expect the tiled implementation to execute more quickly, but it's comparable. For the default and maximum frame processor settings, tiled memory becomes more beneficial and the tiled model processors execute faster than the simple model ones.

There was a similar question here:

Several arithmetic operations pararellized in C++Amp

I posted some code there which shows a filter with a variable size.

edited May 23, 2017 at 12:20

CommunityBot

11 silver badge

answered Sep 4, 2013 at 8:02

Ade Miller

13.8k1 gold badge45 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

logicnet.dk Over a year ago

Thanks for the hint, but the sample only applies 3x3 convolution filters. Also, the tiling doesn't really give any speed up in the average case on a single GPU as I read the performance table on page 253 in the book..

Ade Miller Over a year ago

The conclusion that the book draws seems correct. You should see more benefit from tile_static memory as the size of your filter increases. The larger the filter (within reason) the more use you will be able to make of the memory loaded into the faster tile_static.

Ade Miller Over a year ago

Updated answer with variable size filter reference.

Collectives™ on Stack Overflow

Optimizing 2D convolution filter with C++ AMP

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related