0

Edit

Let's restate everything. I'm implementing Bellman-Ford on OpenMP. As I understand it, the compare step and the setting of dist must be done in a critical block, since updating dist could potentially change the result of the compare step--there is a data race here.

My problem then is that the updated_last_round variable need not be updated in a critical block. There is a data race here, but the only update-to value is true, so it doesn't matter. My worry about the current implementation is that all the threads are atomically updating the updated_last_round and slowing each other down.

bool compare(int v1, int w, int v2) {
    // checks if v1 + w < v2
    if (v1 == INT_MAX) return false;
    if (v2 == INT_MAX) return true;
    return ((v1+w) < v2);
}

vector<int> bellman_ford(Graph& g, int root) {
    vector<int> dist(g.num_nodes());
    # pragma omp parallel for
    for (int i = 0; i < g.num_nodes(); i++)
        dist[i] = INT_MAX; // set INF

    dist[root] = 0;

    int round = 0;
    bool updated_last_round = true;
    // relax procedure
    while (updated_last_round && round < g.num_nodes()) {
        updated_last_round = false;
        #pragma omp parallel for
        for (int u = 0; u < g.num_nodes(); u++) {
            vector<int> neighbors = g.out_neighbors(u);
            vector<int> weights = g.out_weights_neighbors(u);
            #pragma omp parallel for 
            for (int j = 0; j < g.out_degree(u); j++) {
                int v = neighbors[j];
                int weight = weights[j];
                #pragma omp critical
                {
                if (compare(dist[u], weight, dist[v])) { 
                    dist[v] = dist[u] + weight;
                    updated_last_round = updated_last_round || true;
                }
                }
            }
        }
        round += 1;
    }

    /* ... */

    return dist;
}

Original

I'm trying to parallelize some code in OpenMP that requires an atomic check-and-set inside a parallel for loop, and I am computing at the end of each iteration whether at least one value has been set.

Right now I'm using a reduction(||:updated_last_round) to reduce the boolean at the end of every iteration, but I'm not sure if this actually speeds up anything, as the actual line of code that updates the boolean is inside the critical section still.

bool updated_last_round = true

while (updated_last_round) {
  updated_last_round = false;

  #pragma omp parallel for reduction(||:updated_last_round)
  for (/* stuff */) {
    // other stuff
    #pragma omp critical
    {
    if (should_update(vars_to_decide_with)) { 
      // do the important critical update

      // I DON'T want this to be atomically updated, as
      // the data race doesn't matter at all
      updated_last_round = updated_last_round || true;
    }
  }
}

It should make sense that there's a way to let the critical section only do the critical stuff and then proceed to set a thread-local bool value, then reduce the local values at the end of each iteration. How should I achieve that?

7
  • You can't use #pragma omp parallel for with a while loop. Commented Jun 27, 2019 at 2:43
  • Oops my bad. I changed the code while posting. Let me quickly fix Commented Jun 27, 2019 at 2:53
  • I'm not an expert, but you already pay price for entering the critical section. Performing this one operation outside shouldn't affect anything. Commented Jun 27, 2019 at 3:00
  • Just precalculate should_update before the if statement: dpaste.com/2F81ZC9 But since this definitely races, a race condition detection on your program will fail. Commented Jun 27, 2019 at 3:07
  • 1
    What's your actual question? If you want to know whether you get a speed up - measure. If you want to know whether you do the right thing, you must provide more information. Please read up on creating a minimal reproducible example. We can't tell you what's right unless you tell us about "stuff", "other stuff" and "important update". It does matter. Commented Jun 27, 2019 at 7:19

2 Answers 2

1

First, concurrently writing to updated_last_round is technically still a race condition, even if you only write the same value.

However, don't worry about the writes to updated_last_round. Compared to the overall overhead of a critical section it is very unlikely to matter. Do worry about the overhead of a critical section inside of each tiny inner loop iteration. Given the read and write dependency on both dist[v] and dist[u], I don't see any way to resolve the critical section.

How you could add the reduction and still set updated_last_round within the critical section. In theory that would speed up this write because it is now local instead of a shared variable with cache invalidation. But again, that won't matter compared to the huge overhead of a critical section.

Note: The only benefit you would get from parallelization is if the out_*neighbors functions were very expensive. But I assume they just return a fixed vector - which you should return and capture by const& for performance reasons.

If you wanted to effectively parallelize this algorithm you have to think about partitioning your data somehow to resolve the dependency. Be careful: unfortunately searching for "Bellman-Ford OpenMP" shows gives some very incorrect attempts such as this upvoted and accepted answer on SO.

Besides that, don't use nested parallelism (parallel inside of parallel unless you really know what you're doing). Parallelizing the outermost loop that is safe to do so and use collapse if it brings performance benefit

Also good job on declaring variables as locally as possible - this makes it much easier to reason about race conditions. This may be a bit tricky with the vector copies - that should likely be const& anyway.

Sign up to request clarification or add additional context in comments.

1 Comment

First of all, thank you for the detailed response. I think there probably isn't much I can do to optimize this version further. There is a pull-based version that I can implement, where you just let each node update its own distance by finding all its in_neighbors and checking their distances. I think I'll try it for comparison. Thank you so much for this!
1

It should make sense that there's a way to let the critical section only do the critical stuff and then proceed to set a thread-local bool value, then reduce the local values at the end of each iteration. How should I achieve that?

Something like this? It seems to me to be the obvious implementation of what you just described. I have moved the test outside the critical section; without more information it's not clear if that is safe or not...

bool updated_last_round = true

while (updated_last_round) {
  updated_last_round = false;

  #pragma omp parallel for reduction(||:updated_last_round)
  for (/* stuff */) {
    // other stuff
    bool updated_this_iteration = false;

    if (should_update(vars_to_decide_with)) 
     { 
       #pragma omp critical
       {
          // do the important critical update
       }
       // Set local, per-iteration, value
       updated_this_iteration = true;
    }
    updated_last_round = updated_last_round ||  updated_this_iteration;
  }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.