2

First off, I know very little about multithreading and I am having troubles finding how the best way to optimize this code, but multithreading seems the path I should be on.

double
applyFilter(struct Filter *filter, cs1300bmp *input, cs1300bmp *output)
{
    long long cycStart, cycStop;

    cycStart = rdtscll();

    output -> width = input -> width;
    output -> height = input -> height;

    int temp1 = output -> width;
    int temp2 = output -> height;

    int width=temp1-1;
    int height=temp2 -1;
    int getDivisorVar= filter -> getDivisor();  
    int t0, t1, t2, t3, t4, t5, t6, t7, t8, t9;

    int keep0= filter -> get(0,0);
    int keep1= filter -> get(1,0);
    int keep2= filter -> get(2,0);
    int keep3= filter -> get(0,1);
    int keep4= filter -> get(1,1);
    int keep5= filter -> get(2,1);
    int keep6= filter -> get(0,2);
    int keep7= filter -> get(1,2);
    int keep8= filter -> get(2,2);


    //Declare variables before the loop
    int plane, row, col;    

    for (plane=0; plane < 3; plane++) {
        for(row=1; row < height ; row++) {
            for (col=1; col < width; col++) {

                t0 = (input -> color[plane][row - 1][col - 1]) * keep0;
                t1 = (input -> color[plane][row][col - 1]) * keep1;
                t2 = (input -> color[plane][row + 1][col - 1]) * keep2;
                t3 = (input -> color[plane][row - 1][col]) * keep3;
                t4 = (input -> color[plane][row][col]) * keep4;
                t5 = (input -> color[plane][row + 1][col]) * keep5;
                t6 = (input -> color[plane][row - 1][col + 1]) * keep6;
                t7 = (input -> color[plane][row][col + 1]) * keep7;
                t8 = (input -> color[plane][row + 1][col + 1]) * keep8;

                // NEW LINE HERE

                t9 = t0 + t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8;
                t9 = t9 / getDivisorVar;

                if ( t9 < 0 ) {
                    t9 = 0;
                }

                if ( t9  > 255 ) {
                    t9 = 255;
                } 

                output -> color[plane][row][col] = t9;
            } ....

All of this code most likely isn't necessary, but it does provide some context. So because the first of the 3 "for" loop only goes from 0-2 I was hoping there was a way I could thread the bottom two "for" loops to all be running at the same time for a different "plane" value. Is this even possible? And if so, would it actually make my program faster?

3
  • Thats what I was thinking after briefly looking at multithreading, but I thought it might be possible as long as t0-t9 were all somehow local to the thread? Because all other variables are independent of the loops. Commented Mar 28, 2013 at 19:20
  • Make sure everything used in the thread is thread-local so you don't need to worry about the threads stepping on each other, except for the input and output arrays. For those, just ensure programmatically that you'll never read/write the same cells from two different threads, that way you don't need synchronization. Commented Mar 28, 2013 at 19:25
  • I've just tried that out using 4 planes and 2560x1600 image. For a single thread it took 109 ms, 4 threads did it in 47 ms. But as this procedurę is very simple, overhead of creating and waiting for threads was actually significant enough, but it still more than halved time needed for the calculations. More complex loops would definitely benefit more from threading. As I said, no synchronisation whatsoever (apart from waiting for threads to finish) was needed in this example. Commented Mar 28, 2013 at 20:15

5 Answers 5

3

I would also look into OpenMP. It is a great library that allows for threading in a VERY simple manner using pragmas. OpenMP is compileable on many platforms, you just have to make sure yours supports it!

I have a set of code that had 8 levels of for loops, and it threaded it very nicely.

Sign up to request clarification or add additional context in comments.

1 Comment

Once I finally got OpenMP to work the way I wanted to in my code, it ran so much faster! Thanks for the suggestion.
1

Yes, it's perfectly possible. In this case, you should event get away without worrying about access synchronisation (ie race conditions), as both threads would be operating on different sets of data.

This would definitely speed up your code on a multicore machine.

You might want to have a look at std::thread (if you're ok with c++ 11) for cross platform threading implementation (since you haven't specified your target platform). Or better with threading support library

You may also think about detecting number of cores and launch appropriate number of threads, as in threadcount = min(planes, cores) and provide each worker function with access to single plane's set of data.

1 Comment

Sorry for not specifying. This will solely be running on a Ubuntu machine, and the I know in advance how many cores I have to work with, as this is a homework assignment so we are able to design our code specifically around those machines that we will be graded on.
0

It certainly looks like you could break this into thread and you would probably see a good speed increase. However, your compiler will already be trying to unroll the loop for you and gain parallelism by vectorizing instructions. Your gains may not be as much as you suspect, especially if you are saturating the memory bus with reads from disparate locations.

What you might consider is, if this is a 2d graphics operation, try and use OpenGL or similar as it will leverage hardware on your system, and that has some parallelism built into it.

2 Comments

The only thing this code is meant to do is apply a 3x3 matrix filter to a 2d image and throw the altered image back out, so is OpenGL something worth looking into for this?
I certainly would recommend it if you can express what you're doing in terms of matrix multiplication. I believe you want to look at the Shading Library. It might be more effort than you can invest now, but it is certainly a powerful resource.
0

Threaded version of the code will be slower than simple implementation. Because in the threaded version there will be much time spend on synchronisation. Also in threaded version you will have cache performance drawback.

Also it is high probability, that outer for loop with 3 passes will be unrolled by complier and will be executed in parallel.

You can try to make threaded version and compare performance. Anyway it will be usefull experience.

5 Comments

I think that in this particular example he can get away without synchronisation very easily.
I dont necessarily agree that it will be executed in parallel for the 3 pass for loop. I have recently had a similar set of code with also a very small loop count (less than the amount of capable threads in the system) and it didnt simply compile to a multi-threaded version, I had to explicetly do that myself.
@W.B. He will need synchronisation primitive that will guarantee that all threads ended data processing, and result can be used.
@inkooboo that's for sure but that should hardly amount to a substantial overhead, imo.
@trumpetlicks Yes, may be this loop will be not unrolled, but anyway simple version will be faster
0

For a situation like this you could do worse than use a compiler that automatically turns for loops into threads.

With code like that the compiler can determine whether or not there is any inter-iteration data dependency. If not then it knows that it can safely split the for loop(s) across multiple threads, putting bog standard thread syncing at the end. Normally such a compiler is able to insert code that determines at run time whether the overhead of having the threads is going to be outweighed by the benefits.

The only thing is, do you have a compiler that does it? If so then its by far the easiest way to get the benefits of threads for straightforward, almost overt parallelism like this.

I know that Sun's C compiler does it (I think they were one of the earliest to do this. It might be on only the Solaris version of their compiler). I think that Intel's compiler can too. I have doubts about GCC (though I'd be very happy to be corrected on that point), and I'm not too sure about Microsoft's compiler.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.