recursion, global variables, and OpenMP

Question

I'm having troubles with a function that calls itself the proper way of threading with openmp is unclear to me on that case.

long *arrayofindex; 
int length, N;

void gen(long index)
{
    if(index == 0)
    {
        #pragma omp parallel for
        for(int a=0; a<N; ++a)
        {
            #pragma omp critical
            {
            gen(index+1);
            ++arrayofindex[index];
            }
        }
    }
    else
    {
        for(arrayofindex[index]=0; arrayofindex[index]<N; ++arrayofindex[index])
        {
            if(index < length-1)
                gen(index+1);
            else printf("%ld\n", arrayofindex[index]);
        }
    }
}

int main(){
    length = 5, N = 4;
    arrayofindex = (long*) malloc(length * sizeof(long));
    for(int i=0; i<length; ++i)
        arrayofindex[index] = 0;
    gen(0);
}

the way I did it, it runs several threads and is correct but doesn't seem to get any faster than without openmp support.

Is there an way through this ? or is it going to be helpless because of the way of the code.

The complete code https://github.com/e2002e/zhou

What is the initialization statement in for(arrayofindex[index]; ... supposed to accomplish for you? Is that a typo? — John Bollinger
– John Bollinger, Commented Feb 25, 2019 at 22:24
Also, what does criticalfunc() do? In particular, does it access the elements of arrayofindex? — John Bollinger
– John Bollinger, Commented Feb 25, 2019 at 22:26
Furthermore, arrayofindex = (long*) malloc(5); is almost surely wrong. Did you mean arrayofindex = malloc(5 * sizeof(long));? — John Bollinger
– John Bollinger, Commented Feb 25, 2019 at 22:29
Ok, after noticing the use of = instead of == in if(index = 0){, I'm done. As presented, this code has an infinite recursion. Update the code to present a bona fide minimal reproducible example, or we are unlikely to be able to help you. — John Bollinger
– John Bollinger, Commented Feb 25, 2019 at 22:35
Your comment that "without taking into account what is going on in the 'criticalfunc' I miss hits on the function itself" seems to be saying that you have data dependencies between calls to gen() with different arguments. OpenMP has no magic solution for that. But we can neither confirm nor suggest any mitigation without an MCVE to work with. You have not yet presented one, for there is no definition of criticalfunc() (which indeed seems to be critical), nor any example input, expected output, or actual (erroneous) output. — John Bollinger
– John Bollinger, Commented Feb 26, 2019 at 13:53

John Bollinger · Accepted Answer · 2019-02-26 22:23:16Z

2

As I understand the code presented, the general objective is essentially to generate all the N-letter words that can be formed with a length-letter alphabet. Each recursive call to gen() corresponds to one letter position, and so each time control reaches the bottom of the recursion, the first N elements of arrayofindex represent the letters of one word.

But it should be obvious that multiple threads running in parallel cannot use the same arrayofindex. Each thread expects when it reaches the bottom of the recursion to find in arrayofindex the values that it set along its recursive path. That's fundamental to the approach. If other threads are modifying arrayofindex at the same time then they all likely get mishmashes of values set by different threads. Moreover, you probably don't get anything like the speedup you hope for, because the threads need to synchronize their accesses to arrayofindex.

Note: This problem has nothing in particular to do with recursion. You would have exactly the same issue if you modified the code to be iterative instead of recursive -- which I, myself, would in fact do if I were looking to improve performance, though I do not demonstrate that here.

There are various ways to give each OMP thread its own working array. If the space must continue to be dynamically allocated, then you should arrange for it to be allocated inside the parallel region, so that each thread allocates its own. If you're willing and able to rely on variable-length arrays for this, however, then probably the only other thing you need is an OMP private clause attached to your parallel for construct.

For example, this variation on your code works for me:

void gen_tail(int length, int num_letters, int arrayofindex[], int position) {
    for (int letter = 0; letter < num_letters; letter++) {
        arrayofindex[position] = letter;
        if (position + 1 < length) {
            gen_tail(length, num_letters, arrayofindex, position + 1);
        } else {
            // this is the bottom ... do something with arrayofindex, such as:
            #pragma omp critical
            {
                for (int i = 0; i < length; i++) {
                    putchar('A' + arrayofindex[i]);
                }
                putchar('\n');
            }
        }
    }
}

void gen(int length, int num_letters) {
    assert(length > 1);
    int arrayofindex[length];  // Note: VLA, _not_ dynamically allocated

    // Marking the array 'private' means each thread gets its own copy.
    // This would not have the needed effect if 'arrayofindex' were a pointer.
    #pragma omp parallel for private(arrayofindex)
    for (int letter = 0; letter < num_letters; letter++) {
        arrayofindex[0] = letter;
        gen_tail(length, num_letters, arrayofindex, 1);
    }
}

int main(void) {
    gen(5, 4);
}

That emits the expected 1024 (== 4⁵) results for me, all distinct, as I have every reason to expect it should do.

answered Feb 26, 2019 at 22:23

John Bollinger

191k11 gold badges103 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Yvain Over a year ago

works very well, BUT the threads, and overhead of implicit parallel for locks are slowing it I guess.

John Bollinger Over a year ago

That would not be a surprise with my example code, @Yvain, given that it performs I/O in a critical region. In fact, individual stream I/O calls for the same stream are synchronized automatically anyway; the critical region in the example is just to keep each thread's output together. I anticipate that the I/O accounts for the lion's share of all work done by that code, and that will greatly limit the speedup available. You should be able to do better by storing results in memory instead of on disk, or, possibly, having each thread write to a different file.

Yvain Over a year ago

ok, another thing, I conserved the original fashing of the code and used a `#pragma omp taskloop on top of first iteration on the function, the one you kept for gen() in your example. There it does its job, I couldn't tell much without trying to pipe it to john-the-ripper which speed increases a little then, and I did not have to set a single other line than this one...

John Bollinger Over a year ago

@Yvain an omp taskloop does not have different data-sharing implications than an omp for. To the best of my understanding, however, an omp taskloop is useful only in the context of an enclosing omp parallel section. Without one, I think you just have serial execution in a single thread, which explains why the data sharing does not cause a problem in that case. Overall, I think you're I/O bound, not CPU bound, so parallelization is unlikely to help much. Perhaps not at all.

Yvain Over a year ago

that's exactly that, i saw my results with a difference that was due to the system.

|

Collectives™ on Stack Overflow

recursion, global variables, and OpenMP

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related