Intuition over TBB parallel scan/parallel prefix requirements

Ask Question

Asked 15 days ago

Modified 15 days ago

Viewed 74 times

I am reading a paragraph about the tbb::parallel_scan algorithm from the book Intel Threading Building Blocks, and I understood what the operation does serially, but I am not understanding what are the requirements on the body object, the description in the book is incredibly vague, saying that the algorithm can perform 2 passes over the input data. They mention an assign operation and a reverse_join operation.

I am trying to understand when these operations are applied and how they work. This is the body object, fed to parallel_scan:


    class Body {
        T sum;
        T* const y;
        const T* const z;
    public:
        Body( T y_[], const T z_[] ) : sum(id), z(z_), y(y_) {}
        T get_sum() const { return sum; }
    
        template<typename Tag>
        void operator()( const oneapi::tbb::blocked_range<int>& r, Tag ) {
            T temp = sum;
            for( int i=r.begin(); i<r.end(); ++i ) {
                temp = temp + z[i];
                if( Tag::is_final_scan() )
                    y[i] = temp;
            }
            sum = temp;
        }
        Body( Body& b, oneapi::tbb::split ) : z(b.z), y(b.y), sum(id) {}
        void reverse_join( Body& a ) { sum = a.sum + sum; }
        void assign( Body& b ) { sum = b.sum; }
    };

So for each block, they first compute the sum of all the elements and accumulate it in sum (for each block starting from the identity), is this the famous first pass? Then what happens? is assign called to pass the result to the adjacent block? When is the second pass? When is reverse_join called?

edited Nov 14 at 13:24

Peter Cordes

377k50 gold badges742 silver badges1k bronze badges

asked Nov 14 at 10:48

luczzz

4463 silver badges9 bronze badges

2

Note we do not have a book, so you should provide minimal reproducible example. You can start with this: godbolt.org/z/f43eGWa11

Marek R
– Marek R

2025-11-14 11:49:44 +00:00
Commented Nov 14 at 11:49
1

"is assign called to pass the result to the adjacent block?" not in the shown code. "When is reverse_join called?" also not. It appears you are asking for clarification of passages in the book that you did not include in the question. If you include the parts that are unclear in the question maybe someone can explain, but as long as you don't understand what the book tries to say its not sufficient if you paraphrase what it says.

463035818_is_not_an_ai
– 463035818_is_not_an_ai

2025-11-14 12:01:07 +00:00
Commented Nov 14 at 12:01
1

One of the well-known ways to do a parallel prefix-sum is indeed to sum chunks (in parallel) and combine (serially) to get known starting-points for each chunk, allowing parallel work in a second pass. That's probably part of what they're doing. Another trick is to use SIMD and/or ILP within each chunk to hide latency, especially of FP addition, which can speed up that second pass for each thread separately, especially for a CPU rather than GPU. SIMD prefix sum on Intel cpu does some cache-blocking within chunks, too.

Peter Cordes
– Peter Cordes

2025-11-14 12:17:40 +00:00
Commented Nov 14 at 12:17
1

I understand this may appear a vague question, but it's because I am asking about a specific algorithm present in TBB, tbb::parallel_scan and how to use it. There is no code in the book that shows how the algorithm works, just that it takes a blocked_range and a body object as the one I included as inputs. And to me it's unclear what the necessary methods one has to provide actually do. This is very similar to what you get in the book: oneapi-spec.uxlfoundation.org/specifications/oneapi/v1.1-rev-1/…

luczzz
– luczzz

2025-11-14 12:18:31 +00:00
Commented Nov 14 at 12:18
Also Accumulating a running-total (prefix sum) horizontally across an __m256i vector / parallel prefix (cumulative) sum with SSE (especially see comments on Z Boson's self-answer, about doing multiple SIMD vectors at once) / Prefix Sum Parallel Algorithm

Peter Cordes
– Peter Cordes

2025-11-14 12:20:12 +00:00
Commented Nov 14 at 12:20

| Show 2 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Intuition over TBB parallel scan/parallel prefix requirements

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked