Generating Ulam numbers with OpenMP and single-thread versions

Question

I'm making a program that takes an integer n and generates the first n Ulam numbers. I followed this guide about OpenMP.

This is the core function, single thread version:

bool isulam (int n, int size) {
    int count = 0;
    for (int i = 0; i < size; i++)
        for (int j = 0; j < size; j++) {
            if (i != j && ulam[i]+ulam[j] == n) count++;
            if (count > 2) return false;
        }
    return count;
}

And this is my attempt at optimizing it with OpenMP:

bool isulam (int n, int size) {
    int count = 0;
    bool toomany = false;
    for (int i = 0; i < size; i++)
        #pragma omp parallel for reduction (|:toomany)
        for (int j = 0; j < size; j++) {
            if (i != j && ulam[i]+ulam[j] == n) count++;
            if (count > 2) toomany = true;
        }
    if (count > 2) return false;
    return count;
}

I'm compiling with g++ -Ofast -fopenmp. The output is correct, but the OpenMP version is much slower:

$ time ompulam <<< 1000 > /dev/null

real  0m22.211s
user  0m39.697s
sys   0m3.202s
$ time ulam <<< 1000 > /dev/null

real  0m7.073s
user  0m7.017s
sys   0m0.008s

What's happening? My CPU is an AMD E1 2500 (2 cores, 1400Mhz) which may not be the best, but I was hoping for a much different result. Is OpenMP only worth on 4+ cores?

FWIW with a regular #pragma omp parallel for (thus without the toomany), the code is running in 18.583s.

Don't know the details. But spinning up a thread is expensive. Doing it to run one line of a loop seems counterproductive. — Loki Astari
– Loki Astari, Commented Sep 9, 2014 at 0:12
this may result from that time calculates the total time, although i'm not sure; i still remember once i used time make -8j to make and find that the amount is more than time make(1 job). — Hongxu Chen
– Hongxu Chen, Commented Sep 9, 2014 at 6:34

Jamal · Accepted Answer · 2014-12-20 03:56:59Z

My CPU is an AMD E1 2500

That CPU is a dual core CPU, so with perfect multithreading that introduces exactly zero overhead, which at best could double the performance. As the comments already pointed out, the overhead is more than zero. Other than that, your CPU is a Kabini low power CPU with 15W TDP, so it's not unlikely that thermal throttling lowers the clock speed when running multithreaded - try to look at the actual clock speed (not % use of CPU) when running the test again.

With that out of the way, the problem in your code is that you introduce absolutely massive overhead. In addition to the overhead due to having threads, the variable count is shared between threads, which forces the threads to wait for each other all the time. Also, your code will always run the entire loop, while the single-threaded version aborts early. A much faster version is the following, although that one still does not abort early (disclaimer: not tested, as I don't have OpenMP installed).

bool isulam (int n, int size) {
    int count = 0;
    bool toomany = false;
    for (int i = 0; i < size; i++)
        #pragma omp parallel for reduction (+:count)
        for (int j = 0; j < size; j++) {
            if (i != j && ulam[i]+ulam[j] == n) count++;                
        }
    return count <= 2;
}

Stack Exchange Network

Generating Ulam numbers with OpenMP and single-thread versions

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Generating Ulam numbers with OpenMP and single-thread versions

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions