I define an algorithm to be best if it minimizes the total run time over commodity hardware (e.g. normal desktop and server PCs).
I have sets A and B. I also have function f that works like f(a, b, n), where a is some item in A, b is some item in B, and n is some natural number. The output that this f(a, b, n) returns is a positive real number.
These properties are true:
- f(a,b,n) has a randomized component in it. So invoking the function twice will likely give you two different answers each time (albeit close to each other, but nonetheless different). Larger values of n will only reduce the estimation error of the function $f$.
- f(a,b,n) is twice slower as f(a,b,n/2).
- f(a,b,n) = (f(a,b,n/2) + f(a,b,n/2))/2
- Computing f(a_1,b_1,n) is independent from f(a_2,b_2,n), where a_1,a_2 are distinct elements from A, and b_1,b_2 are also distinct elements from B.
My goal is to compute the following the value of answer as follows:
count = 0
output = 0
for a in A:
for b in B:
output += f(a,b,n)
count += 1
answer = output / count
return answer
But the code above is highly serialized. What I wish to do is to parallelize that such that I minimize the total number of seconds needed for me to obtain answer answer.
Just to re-iterate: I am running this application on a single computer, and this is why I am considering a multi-threaded application. I do not wish to distribute it over multiple machines. All I want is to just run it really fast over the many cores that I have on a single computer.