Multithread program in C++ shows the same performance as a serial one

Question

I just want to write a simple program in C++, which creates two threads and each of them fills vector by squares of integers (0, 1, 4, 9, ...). Here is my code:

#include <iostream>
#include <vector>
#include <functional>
#include <thread>
#include <time.h>

#define MULTI 1
#define SIZE 10000000

void fill(std::vector<unsigned long long int> &v, size_t n)
{
    for (size_t i = 0; i < n; ++i) {
        v.push_back(i * i);
    }
}

int main()
{
    std::vector<unsigned long long int> v1, v2;
    v1.reserve(SIZE);
    v2.reserve(SIZE);
    #if !MULTI
    clock_t t = clock();
    fill(v1, SIZE);
    fill(v2, SIZE);
    t = clock() - t;
    #else
    clock_t t = clock();
    std::thread first(fill, std::ref(v1), SIZE);
    fill(v2, SIZE);
    first.join();
    t = clock() - t;
    #endif
    std::cout << (float)t / CLOCKS_PER_SEC << std::endl;
    return 0;
}

But when I run my program, I see, that there is no significant difference between the serial version and the parallel one (or sometimes parallel version shows even worse results). Any idea what happens?

One way to test if it is false sharing would be to have the fill function perform a new to create the vector, then fill the vector, and then return a pointer to the vector (possibly through a reference parameter). That would probably fix any false sharing, which happens when two threads are modifying different data that happens to be on the same cache line. — Kyle A
– Kyle A, Commented Feb 20, 2016 at 16:09
@Kyle Sorry, it can't be. I missed that here are two vectors. Anyways, push_back would need a lock. — LogicStuff
– LogicStuff, Commented Feb 20, 2016 at 16:11
The program generates consecutive memory write accesses to widely separated addresses with virtually no computational time involved in the two threads. Execution time is basically just a function of the hardware's memory cache architecture. Two, long consecutive sequential writes are likely as fast or faster than two interleaving sequential writes. — doug
– doug, Commented Feb 20, 2016 at 16:25
man clock(): The clock() function returns an approximation of processor time used by the program. I invite you to ponder the meaning of "processor time". — user3185968
– user3185968, Commented Feb 20, 2016 at 16:26
You are exercising the memory-subsystem in this code, not the processor. You only have one. It is also serialized by the virtual memory paging that it triggers, you pay for RAM allocation when you first address memory. — Hans Passant
– Hans Passant, Commented Feb 20, 2016 at 16:39

Community · Accepted Answer · 2017-05-23 10:30:27Z

2

When I execute your code with MSVC2015 on a i7, I observe:

in debug mode, multithread is 14s compared to 26s in monothread. So it's almost twice as fast. The results are as expected.
in release mode, multithread is 0.3 compared to 0.2 in monothread, so it's slower, as you've reported.

This suggest that your issue is related to the fact that the optimized fill() is too short compared to the overhead of creating a thread.

Note also that even when there is enought work to do in fill() (e.g. the unoptimized version), the multithread will not multiply the time by two. Multithreading will increase overall throughput per second on a multicore processor, but each thread taken separately might run a little bit slower than usual.

Edit: additional information

The multithreading performance depends on a lot of factors, among others, for example the number of cores on your processor, the cores used by other processes running during the test, and as remarked by doug in his comment, the profile of the multithreaded task (i.e. memory vs. computing).

To illustrate this, here the results of an informal benchmark that shows that decrease of individual thread throughput is much faster for memory intensive than for floating point intensive computations, and global throughput grows much slower (if at all):

Using the following functions for each thread :

// computation intensive
void mytask(unsigned long long loops)
{
    volatile double x; 
    for (unsigned long long i = 0; i < loops; i++) {
        x = sin(sqrt(i) / i*3.14159);
    }
}

//memory intensive
void mytask2(vector<unsigned long long>& v, unsigned long long loops)
{
    for (unsigned long long i = 0; i < loops; i++) {
        v.push_back(i*3+10);
    }
}

edited May 23, 2017 at 10:30

CommunityBot

11 silver badge

answered Feb 20, 2016 at 16:34

Christophe

75.1k7 gold badges87 silver badges182 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

doug Over a year ago

The thread overhead is minimal in comparison to the memory write bottleneck in the optimized code. When multithreaded where memory writes in different address segments there is additional overhead syncing the memory writes. Multithreading operates most effectively when each thread is compute intensive rather than memory intensive.

Christophe Over a year ago

@doug yes that's true. I've edited the answer to highligh both effects, and show some experimental measurements.

ravenspoint · Accepted Answer · 2016-02-20 16:29:50Z

0

The fill function will run so fast that the thread overhead is likely as long as the execuition.

Replace fill with something that takes a significant amount of time to execute. As a first pass, use std::this_thread::sleep_for

answered Feb 20, 2016 at 16:29

ravenspoint

21.2k7 gold badges64 silver badges114 bronze badges

Comments

B. Chappet · Accepted Answer · 2016-02-20 16:50:06Z

Most of the suggestions are right: threading a task will improve the execution time only if the thread cpu load (in your case the multiplication i * i) is more important than the shared memory access load (in your case v.push_back). You can try with this code. You will see the gains of threading. And you can use the unix command

>time ./a.out

to time your code more easily.

#include <iostream>
#include <vector>
#include <functional>
#include <thread>
#include <time.h>
#include <math.h>

#define MULTI 1
#define SIZE 10000000

void fill(std::vector<unsigned long long int> &v, size_t n)
{
    int sum = 0;
    for (size_t i = 0; i < n; ++i) {
        for (size_t j = 0; j < 100; ++j) {
            sum += sqrt(i*j);
        }
    }
    v.push_back(sum);
}

int main()
{
    std::vector<unsigned long long int> v1, v2;
    v1.reserve(SIZE);
    v2.reserve(SIZE);
    #if !MULTI
    fill(v1, SIZE);
    fill(v2, SIZE);
    #else
    std::thread first(fill, std::ref(v1), SIZE);
    std::thread second(fill, std::ref(v2), SIZE);

    first.join();
    second.join();
    #endif
    return 0;
}

Collectives™ on Stack Overflow

Multithread program in C++ shows the same performance as a serial one

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related