3

I am working on implementing Image convolution in C++, and I already have a naive working code based on the given pseudo code:

for each image row in input image:
   for each pixel in image row:

      set accumulator to zero

      for each kernel row in kernel:
         for each element in kernel row:

            if element position  corresponding* to pixel position then
               multiply element value  corresponding* to pixel value
               add result to accumulator
            endif

      set output image pixel to accumulator

As this can be a big bottleneck with big Images and Kernels, I was wondering if there exist some other approach to make things faster ? even with additionnal input info like : sparse image or kernel, already known kernel etc...

I know this can be parallelized, but it's not doable in my case.

4
  • Some kernels are linearly separable (Google this). This provides a huge speedup (order of magnitude). Commented Mar 2, 2018 at 3:42
  • Optimization without knowing platform on which you want to run is not possible how do we know what it is capable of or what is faster or slower there (when you did not specify it at all)? when parallelization is not possible then there are other means like SIMD , auto-generated hard-coded convolution, reuse of sub results and possibly a lot of other techniques .... Commented Mar 2, 2018 at 8:20
  • @Spektre it's because I am talking about the algorithm, that is to say a better way/approach to code this. Commented Mar 2, 2018 at 17:06
  • Take a look at this and this papers Commented Mar 31, 2018 at 11:41

3 Answers 3

12
if element position  corresponding* to pixel position then

I presume this test is meant to avoid a multiplication by 0. Skip the test! multiplying by 0 is way faster than the delays caused by a conditional jump.

The other alternative (and it's always better to post actual code rather than pseudo-code, here you have me guessing at what you implemented!) is that you're testing for out-of-bounds access. That is terribly expensive also. It is best to break up your loops so that you don't need to do this testing for the majority of the pixels:

for (row = 0; row < k/2; ++row) {
   // inner loop over kernel rows is adjusted so it only loops over part of the kernel
}
for (row = k/2; row < nrows-k/2; ++row) {
   // inner loop over kernel rows is unrestricted
}
for (row = nrows-k/2; row < nrows; ++row) {
   // inner loop over kernel rows is adjusted
}

Of course, the same applies to loops over columns, leading to 9 repetitions of the inner loop over kernel values. It's ugly but way faster.

To avoid the code repetition you can create a larger image, copy the image data over, padded with zeros on all sides. The loops now do not need to worry about accessing out-of-bounds, you have much simpler code.


Next, a certain class of kernel can be decomposed into 1D kernels. For example, the well-known Sobel kernel results from the convolution of [1,1,1] and [1,0,-1]T. For a 3x3 kernel this is not a huge deal, but for larger kernels it is. In general, for a NxN kernel, you go from N2 to 2N operations.

In particular, the Gaussian kernel is separable. This is a very important smoothing filter that can also be used for computing derivatives.

Besides the obvious computational cost saving, the code is also much simpler for these 1D convolutions. The 9 repeated blocks of code we had earlier become 3 for a 1D filter. The same code for the horizontal filter can be re-used for the vertical one.


Finally, as already mentioned in MBo's answer, you can compute the convolution through the DFT. The DFT can be computed using the FFT in O(MN log MN) (for an image of size MxN). This requires padding the kernel to the size of the image, transforming both to the Fourier domain, multiplying them together, and inverse-transforming the result. 3 transforms in total. Whether this is more efficient than the direct computation depends on the size of the kernel and whether it is separable or not.

Sign up to request clarification or add additional context in comments.

3 Comments

Excellent and through answer. I am very skeptical about the FFT solution for kernels of small and moderate size.
@YvesDaoust: I agree. You need to time it to see if it's faster. And then you need to decide what boundary conditions you need, the periodic assumption of the DFT is not always useful.
Thanks for this very interesting and complete answer, exactly the kind of tip I was looking for !
3

For small kernel size simple method might be faster. Also note that separable kernels (for example, Gauss kernel is separable) as mentioned, allow to make filtering by lines then by columns, resulting O(N^2 * M) complexity.

For other cases: there exists fast convolution based on FFT (Fast Fourier Transform). It's complexity is O(N^2*logN) (where N is size of image ) comparing to O(N^2*M^2) for naive implementation.

Of course, there some peculiarities in applying this techniques, for example, edge effects, but one needs to account for them in naive implementation too (in a lesser degree though).

 FI = FFT(Image)
 FK = FFT(Kernel)
 Prod = FI * FK (element-by-element complex multiplication)
 Conv(I, K) = InverseFFT(Prod)

Note that you can use some fast library intended for image filtering, for example, OpenCV allows to apply kernel to 1024x1024 image in 5-30 milliseconds.

7 Comments

For small kernels, M² beats Log N, not counting that the hidden constant is much better for the naive implementation.
@Yves Daoust OK, added this info to answer (while I think that author understand that there is no silver bullet for all cases)
Thanks for this answer.
that pdf link is unreadable (possibly some weird font) how do you handle different resolution of N and M (how to piece wise multiply them when they do not have the same resolution)? Is a non power of 2 resolution a problem (and how to handle it if zero padding changes results) or not? btw here How to compute Discrete Fourier Transform? are some links with my C++ implementations for DFT/DFFT/DCT/DFCT
@Spektre Official site (perhaps pdf is the same): nrbook.com/a/bookcpdf.php All chapters in a single pdf: www2.units.it/ipl/students_area/imm2/files/…
|
1

One way to this speed up, might be, depending on target platform, to distinctly get every value in the kernel, then, in memory, store multiple copies of the image, one for every distinct value in the kernel, and multiply each copy of the image by its distinct kernel value, then at the end, multiply by distinct kernel value, shift, sum and divide up all the image copies into one image. This could be done on a graphics processor for example where memory is ample and which is more suited for this tight repetitive processing. The copies of the image will need to support overflow of the pixels, or you could use floating point values.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.