0

I'm having troubles with a function that calls itself the proper way of threading with openmp is unclear to me on that case.

long *arrayofindex; 
int length, N;

void gen(long index)
{
    if(index == 0)
    {
        #pragma omp parallel for
        for(int a=0; a<N; ++a)
        {
            #pragma omp critical
            {
            gen(index+1);
            ++arrayofindex[index];
            }
        }
    }
    else
    {
        for(arrayofindex[index]=0; arrayofindex[index]<N; ++arrayofindex[index])
        {
            if(index < length-1)
                gen(index+1);
            else printf("%ld\n", arrayofindex[index]);
        }
    }
}

int main(){
    length = 5, N = 4;
    arrayofindex = (long*) malloc(length * sizeof(long));
    for(int i=0; i<length; ++i)
        arrayofindex[index] = 0;
    gen(0);
}

the way I did it, it runs several threads and is correct but doesn't seem to get any faster than without openmp support.

Is there an way through this ? or is it going to be helpless because of the way of the code.

The complete code https://github.com/e2002e/zhou

10
  • What is the initialization statement in for(arrayofindex[index]; ... supposed to accomplish for you? Is that a typo? Commented Feb 25, 2019 at 22:24
  • Also, what does criticalfunc() do? In particular, does it access the elements of arrayofindex? Commented Feb 25, 2019 at 22:26
  • Furthermore, arrayofindex = (long*) malloc(5); is almost surely wrong. Did you mean arrayofindex = malloc(5 * sizeof(long));? Commented Feb 25, 2019 at 22:29
  • Ok, after noticing the use of = instead of == in if(index = 0){, I'm done. As presented, this code has an infinite recursion. Update the code to present a bona fide minimal reproducible example, or we are unlikely to be able to help you. Commented Feb 25, 2019 at 22:35
  • 1
    Your comment that "without taking into account what is going on in the 'criticalfunc' I miss hits on the function itself" seems to be saying that you have data dependencies between calls to gen() with different arguments. OpenMP has no magic solution for that. But we can neither confirm nor suggest any mitigation without an MCVE to work with. You have not yet presented one, for there is no definition of criticalfunc() (which indeed seems to be critical), nor any example input, expected output, or actual (erroneous) output. Commented Feb 26, 2019 at 13:53

1 Answer 1

2

As I understand the code presented, the general objective is essentially to generate all the N-letter words that can be formed with a length-letter alphabet. Each recursive call to gen() corresponds to one letter position, and so each time control reaches the bottom of the recursion, the first N elements of arrayofindex represent the letters of one word.

But it should be obvious that multiple threads running in parallel cannot use the same arrayofindex. Each thread expects when it reaches the bottom of the recursion to find in arrayofindex the values that it set along its recursive path. That's fundamental to the approach. If other threads are modifying arrayofindex at the same time then they all likely get mishmashes of values set by different threads. Moreover, you probably don't get anything like the speedup you hope for, because the threads need to synchronize their accesses to arrayofindex.


Note: This problem has nothing in particular to do with recursion. You would have exactly the same issue if you modified the code to be iterative instead of recursive -- which I, myself, would in fact do if I were looking to improve performance, though I do not demonstrate that here.


There are various ways to give each OMP thread its own working array. If the space must continue to be dynamically allocated, then you should arrange for it to be allocated inside the parallel region, so that each thread allocates its own. If you're willing and able to rely on variable-length arrays for this, however, then probably the only other thing you need is an OMP private clause attached to your parallel for construct.

For example, this variation on your code works for me:

void gen_tail(int length, int num_letters, int arrayofindex[], int position) {
    for (int letter = 0; letter < num_letters; letter++) {
        arrayofindex[position] = letter;
        if (position + 1 < length) {
            gen_tail(length, num_letters, arrayofindex, position + 1);
        } else {
            // this is the bottom ... do something with arrayofindex, such as:
            #pragma omp critical
            {
                for (int i = 0; i < length; i++) {
                    putchar('A' + arrayofindex[i]);
                }
                putchar('\n');
            }
        }
    }
}

void gen(int length, int num_letters) {
    assert(length > 1);
    int arrayofindex[length];  // Note: VLA, _not_ dynamically allocated

    // Marking the array 'private' means each thread gets its own copy.
    // This would not have the needed effect if 'arrayofindex' were a pointer.
    #pragma omp parallel for private(arrayofindex)
    for (int letter = 0; letter < num_letters; letter++) {
        arrayofindex[0] = letter;
        gen_tail(length, num_letters, arrayofindex, 1);
    }
}

int main(void) {
    gen(5, 4);
}

That emits the expected 1024 (== 45) results for me, all distinct, as I have every reason to expect it should do.

Sign up to request clarification or add additional context in comments.

6 Comments

works very well, BUT the threads, and overhead of implicit parallel for locks are slowing it I guess.
That would not be a surprise with my example code, @Yvain, given that it performs I/O in a critical region. In fact, individual stream I/O calls for the same stream are synchronized automatically anyway; the critical region in the example is just to keep each thread's output together. I anticipate that the I/O accounts for the lion's share of all work done by that code, and that will greatly limit the speedup available. You should be able to do better by storing results in memory instead of on disk, or, possibly, having each thread write to a different file.
ok, another thing, I conserved the original fashing of the code and used a `#pragma omp taskloop on top of first iteration on the function, the one you kept for gen() in your example. There it does its job, I couldn't tell much without trying to pipe it to john-the-ripper which speed increases a little then, and I did not have to set a single other line than this one...
@Yvain an omp taskloop does not have different data-sharing implications than an omp for. To the best of my understanding, however, an omp taskloop is useful only in the context of an enclosing omp parallel section. Without one, I think you just have serial execution in a single thread, which explains why the data sharing does not cause a problem in that case. Overall, I think you're I/O bound, not CPU bound, so parallelization is unlikely to help much. Perhaps not at all.
that's exactly that, i saw my results with a difference that was due to the system.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.