1

I'm fairly new to GPU programming and C++ AMP. Can anyone help make a general optimized 2D image convolution filter? My fasted version so far is listed below. Can this be done better with tiling in some way? This version works and is much faster than my CPU implementation but I hope to get it even better.

void FIRFilterCore(array_view<const float, 2> src, array_view<float, 2> dst, array_view<const float, 2> kernel)
{
    int vertRadius = kernel.extent[0] / 2;
    int horzRadius = kernel.extent[1] / 2;

    parallel_for_each(src.extent, [=](index<2> idx) restrict(amp)
    {
        float sum = 0;
        if (idx[0] < vertRadius || idx[1] < horzRadius ||
            idx[0] >= src.extent[0] - vertRadius || idx[1] >= src.extent[1] - horzRadius)
        {
            // Handle borders by duplicating edges
            for (int dy = -vertRadius; dy <= vertRadius; dy++)
            {
                index<2> srcIdx(direct3d::clamp(idx[0] + dy, 0, src.extent[0] - 1), 0);
                index<2> kIdx(vertRadius + dy, 0);
                for (int dx = -horzRadius; dx <= horzRadius; dx++)
                {
                    srcIdx[1] = direct3d::clamp(idx[1] + dx, 0, src.extent[1] - 1);
                    sum += src[srcIdx] * kernel[kIdx];
                    kIdx[1]++;
                }
            }
        }
        else // Central part
        {
            for (int dy = -vertRadius; dy <= vertRadius; dy++)
            {
                index<2> srcIdx(idx[0] + dy, idx[1] - horzRadius);
                index<2> kIdx(vertRadius + dy, 0);
                for (int dx = -horzRadius; dx <= horzRadius; dx++)
                {                   
                    sum += src[srcIdx] * kernel[kIdx];
                    srcIdx[1]++;
                    kIdx[1]++;
                }
            }
        }
        dst[idx] = sum;
    });
}

Another way to go around it would of course be to perform the convolution in the Fourier domain, but I'm not sure it would perform as long as the filter is fairly small compared to the image (which does not have side lengths which are powers of 2 by the way).

1 Answer 1

1

You can find a complete implementation of the Cartoonizer algorithm. which implements a couple of stencil based algorithms on Codeplex. http://ampbook.codeplex.com/

This includes several different implementations. The tradeoffs associated with them are discussed in the book that the samples were written for.

For the minimum frame processor settings (1 simplifier phase and a border width of 1), there is insufficient shared memory access to take advantage of tiled memory. This is clearly shown by comparing the times taken by the cartoonizing stage for the C++ AMP simple model (4.9 ms) and the tiled model (4.2 ms) running on a single GPU. You would expect the tiled implementation to execute more quickly, but it's comparable. For the default and maximum frame processor settings, tiled memory becomes more beneficial and the tiled model processors execute faster than the simple model ones.

There was a similar question here:

Several arithmetic operations pararellized in C++Amp

I posted some code there which shows a filter with a variable size.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the hint, but the sample only applies 3x3 convolution filters. Also, the tiling doesn't really give any speed up in the average case on a single GPU as I read the performance table on page 253 in the book..
The conclusion that the book draws seems correct. You should see more benefit from tile_static memory as the size of your filter increases. The larger the filter (within reason) the more use you will be able to make of the memory loaded into the faster tile_static.
Updated answer with variable size filter reference.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.