C++ multithreading performance slower than single threaded code

Question

I'm learning to use thread in c++
I created a very long vector with integers and set another integer x. And I want to calculate the difference between that integer and the integers in the vector.

However, in my implementation, the function using two threads is slower than a single thread function. I wonder why is the reason, and how can I implement threading correctly so it does run faster.

Here's the code:

#include <iostream>
#include <vector>
#include <thread>
#include <future>
#include <math.h>

using namespace std;


vector<int> vector_generator(int size) {
    vector<int> temp;
    for (int i = 0; i < size; i++) {
        temp.push_back(i);
    }
    return temp;
}

vector<int> dist_calculation(int center, vector<int> &input, int start, int end) {
    vector<int> temp;
    for (int i = start; i < end; i++) {
        temp.push_back(abs(center - input[i]));
    }
    return temp;
}


void multi_dist_calculation(int center, vector<int> &input) {
    int mid = input.size() / 2;

    vector<int> temp1(input.begin(), input.begin() + mid);
    vector<int> temp2(input.begin()+mid, input.end());

    auto future1 = async(dist_calculation, center, temp1, 0, mid);
    auto future2 = async(dist_calculation, center, temp2, 0, mid);

    vector<int> result1 = future1.get();
    vector<int> result2 = future2.get();

    return;
}


int main() {

    vector<int> v1 = vector_generator(1000000000);
    vector<int> result;
    multi_dist_calculation(0, v1);
    //dist_calculation(50, v1, 0, v1.size());

    return 0;
}

Update #1

Added suggestions of std::launch::async & reserve(), and it does make the code faster. But the 2-threaded function is still slower than single threaded one. Can I say in this kind of calculation, single-threaded is faster?

#include <iostream>
#include <vector>
#include <thread>
#include <future>
#include <math.h>

using namespace std;


vector<int> vector_generator(int size) {
    vector<int> temp;
    temp.reserve(size);
    for (int i = 0; i < size; i++) {
        temp.push_back(i);
    }
    return temp;
}

vector<int> dist_calculation(int center, vector<int> &input, int start, int end) {
    vector<int> temp;
    temp.reserve(end - start);
    for (int i = start; i < end; i++) {
        temp.push_back(abs(center - input[i]));
    }
    return temp;
}


void multi_dist_calculation(int center, vector<int> &input) {
    int mid = input.size() / 2;

    auto future1 = async(std::launch::async, dist_calculation, center, input,   0, mid);
    auto future2 = async(std::launch::async, dist_calculation, center, input, mid, input.size());

    vector<int> result1 = future1.get();
    vector<int> result2 = future2.get();

    return;
}


int main() {

    vector<int> v1 = vector_generator(1000000000);
    vector<int> result;
    int center = 0;
    multi_dist_calculation(center, v1);
    //dist_calculation(center, v1, 0, v1.size());

    return 0;
}

You don't set launch policy in your async calls. Isn't that UB? Try using async(std::launch::async, dist_calculation, center, temp1, 0, mid);. — freakish
– freakish, Commented Apr 18, 2019 at 9:54
How did you measure? How many cores does your hardware provide? — πάντα ῥεῖ
– πάντα ῥεῖ, Commented Apr 18, 2019 at 9:55
Could be of interest about why multithread can be slower than single thread: stackoverflow.com/a/35525942/3723423 — Christophe
– Christophe, Commented Apr 18, 2019 at 10:01
Currently you have asynchronous calculations that may or may not be multithreaded. Also this code suffers from huge overhead caused by reallocations. — user7860670
– user7860670, Commented Apr 18, 2019 at 10:02
Can you give numbers? If the execution of the threads is very short this may give poor results. — asu
– asu, Commented Apr 18, 2019 at 10:17

Fire Lancer · Accepted Answer · 2019-04-18 10:13:43Z

6

You did not pass any std::launch policy to std::async, so it leaves the implementation a lot of freedom.

Behaves as if (2) is called with policy being std::launch::async | std::launch::deferred. In other words, f may be executed in another thread or it may be run synchronously when the resulting std::future is queried for a value.

But also be aware that more generally, using more threads, especially on small tasks may not be faster.

Where dist_calculation or any task you want to thread is a small amount of work, be aware of the overheads. Creating a new thread has a relatively high cost, and there is also overhead for whatever internal pool std::async uses, promises, and futures.
Additionally, the way written, it is likely you create more vectors, with more dynamic memory, and you will need to merge the results which will also have some cost.
In more complex cases, if synchronization, e.g. with std::mutex is involved, that may cost more performance than additional threads gain.
In some cases, the bottleneck will not be the CPU. For example it may be disk/storage speed (including for page/swap file), network speed (including remote servers), or even memory bandwidth (excepting NUMA aware optimizations, they are a lot more complex than just using std::async). Multithreading in these will just add overheads, but no benefit.

You should make use of other basic optimisations where possible first, such as reserve the size of the vectors to avoid unneeded allocations and copies, maybe resize and use the vector[index] = a instead of push_back, etc.

For something as simple as abs(centre - input[i]) you might get a lot more improvement from SIMD (Single instruction multiple data) optimisations. e.g. make sure you are compiling with any optimizations such as SSE2 enabled, and if the compiler doesn't optimise the loop appropriately (I think the push_back may interfere, test!), alter it slightly so it does, or maybe even use the vector instructions explicitly (for x86 check out _mm_add_epi32 etc.).

edited Apr 18, 2019 at 10:13

answered Apr 18, 2019 at 10:03

Fire Lancer

30.3k34 gold badges125 silver badges186 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

User800222 Over a year ago

Thanks for the reply. I will def. look into it. BTW, I was using std::thread instead of std::async, and it produced the same result which single-threaded is faster. Was it due to the same reason?

Davide Spataro Over a year ago

@User800222, async is an abstraction built on top of std::thread. I am not surprised the two implementations behave similarly.

User800222 Over a year ago

I used std::thread & std::ref() to get the calculated result, but I saw someone of stackoverflow saying thats a bad way to do it. Then I looked it up further and found out people recommend to use std::async() if you want threads to return values.

freakish Over a year ago

@User800222 That's not a bad way to do it. Everything in the language is there for you to utilize. Your execution time is probably dominated by all the vector copying, slicing and resizing (note that heap allocators often utilize locks to serialize execution, you have an implicit lock in your code on push_back, because you didn't reserve). Here's an idea: for the given input allocate result vector with the same size in multi_dist_calculation. Then pass both (as reference) to separate threads together with start, end boundaries and update the result.

User800222 Over a year ago

I've added the suggestions, and it does overall make the code faster, but single-threaded function is still faster. Can I say it for sure that in this case single-threaded is better in this kind of calculation? btw, adding std::launch::async doesnt seem to do much

|

Michael Kenzel · Accepted Answer · 2019-05-16 16:45:46Z

If I understand this correctly, then your singlethreaded version simply calls dist_calculation on the given vector, which will run through the vector once and produce a new vector which it returns. Your multithreaded version, on the other hand, first copies each half of the input data to two separate vectors. After that, it launches the dist_calculation for each half, potentially in two threads. std::async may not even run using a separate thread because you didn't specify a launch policy. If it happens to run using a separate thread, the vectors you pass will be copied once more due to the way the std::thread constructor works. You'd have to use, e.g., an std::reference_wrapper to properly pass a reference. Think about what's going on there. Copying the data to these two vectors means that you're already running through the entire data once before you even get to the point where you'd be launching some threads to perform the actual computation, if threads are even going to be used to begin with. And if they are, you're copying the data a second time. Both times, the copy happens in a single thread. So just to get to the point where your actual multithreaded computation starts—if it does actually happen using multiple threads—your "multithreaded" version has basically had to do three times the work that your singlethreaded version will do in total. And it had to do all of that in a single thread…

Your dist_calculation does not modify the original data. Furthermore, the individual threads are going to access completely separate parts of the data. Thus, there is no need to ever copy the data at all. Simply give each thread a pointer/iterator to the part of the original data it should process. Also, you know beforehand how many elements are going to be produced. So rather than continuously growing two separate output vectors, you could just preallocate a vector to receive the output and give each thread yet another pointer to the subrange of the output vector it should write to.

And finally, starting with C++17, you could just use the parallel versions of std::transform with std::execution::par_unseq and get yourself an automagically parallelized (potentially even vectorized) solution. For example:

template <typename ExecutionPolicy, typename ForwardIterator, typename OutputIterator>
void multi_dist_calculation(OutputIterator dest, ForwardIterator begin, ForwardIterator end, int center, ExecutionPolicy&& execution_policy)
{
    std::transform(std::forward<ExecutionPolicy>(execution_policy), begin, end, dest, [center](int x)
    {
        return std::abs(center - x);
    });
}

int main()
{
    constexpr int data_size = 1000000000;
    auto data = std::unique_ptr<int[]> { new int[data_size] };
    std::iota(&data[0], &data[0] + data_size, 0);

    auto result = std::unique_ptr<int[]> { new int[data_size] };
    multi_dist_calculation(&result[0], &data[0], &data[0] + data_size, 0, std::execution::par_unseq);  // multithreaded
    // multi_dist_calculation(&result[0], &data[0], &data[0] + data_size, 0, std::execution::seq);  // singlethreaded

    return 0;
}

I know its been a while, after realizing how transform works, your code has been really informative. They are just so many useful std functions that not many people know, and thank you for bringing this up. One question tho, when doing multi-threading with transform, is there a way to specify how many threads for function to use?
@User800222 I'm afraid there's currently no standard way to influence how many threads will be used. One can hope that the work currently going on with executors may also result in some more control on how the parallel standard algorithms are scheduled, but that remains to be seen…

Collectives™ on Stack Overflow

C++ multithreading performance slower than single threaded code

2 Answers 2

7 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related