1

I am reading a paragraph about the tbb::parallel_scan algorithm from the book Intel Threading Building Blocks, and I understood what the operation does serially, but I am not understanding what are the requirements on the body object, the description in the book is incredibly vague, saying that the algorithm can perform 2 passes over the input data. They mention an assign operation and a reverse_join operation.

I am trying to understand when these operations are applied and how they work. This is the body object, fed to parallel_scan:


    class Body {
        T sum;
        T* const y;
        const T* const z;
    public:
        Body( T y_[], const T z_[] ) : sum(id), z(z_), y(y_) {}
        T get_sum() const { return sum; }
    
        template<typename Tag>
        void operator()( const oneapi::tbb::blocked_range<int>& r, Tag ) {
            T temp = sum;
            for( int i=r.begin(); i<r.end(); ++i ) {
                temp = temp + z[i];
                if( Tag::is_final_scan() )
                    y[i] = temp;
            }
            sum = temp;
        }
        Body( Body& b, oneapi::tbb::split ) : z(b.z), y(b.y), sum(id) {}
        void reverse_join( Body& a ) { sum = a.sum + sum; }
        void assign( Body& b ) { sum = b.sum; }
    };

So for each block, they first compute the sum of all the elements and accumulate it in sum (for each block starting from the identity), is this the famous first pass? Then what happens? is assign called to pass the result to the adjacent block? When is the second pass? When is reverse_join called?

7
  • 2
    Note we do not have a book, so you should provide minimal reproducible example. You can start with this: godbolt.org/z/f43eGWa11 Commented Nov 14 at 11:49
  • 1
    "is assign called to pass the result to the adjacent block?" not in the shown code. "When is reverse_join called?" also not. It appears you are asking for clarification of passages in the book that you did not include in the question. If you include the parts that are unclear in the question maybe someone can explain, but as long as you don't understand what the book tries to say its not sufficient if you paraphrase what it says. Commented Nov 14 at 12:01
  • 1
    One of the well-known ways to do a parallel prefix-sum is indeed to sum chunks (in parallel) and combine (serially) to get known starting-points for each chunk, allowing parallel work in a second pass. That's probably part of what they're doing. Another trick is to use SIMD and/or ILP within each chunk to hide latency, especially of FP addition, which can speed up that second pass for each thread separately, especially for a CPU rather than GPU. SIMD prefix sum on Intel cpu does some cache-blocking within chunks, too. Commented Nov 14 at 12:17
  • 1
    I understand this may appear a vague question, but it's because I am asking about a specific algorithm present in TBB, tbb::parallel_scan and how to use it. There is no code in the book that shows how the algorithm works, just that it takes a blocked_range and a body object as the one I included as inputs. And to me it's unclear what the necessary methods one has to provide actually do. This is very similar to what you get in the book: oneapi-spec.uxlfoundation.org/specifications/oneapi/v1.1-rev-1/… Commented Nov 14 at 12:18
  • Also Accumulating a running-total (prefix sum) horizontally across an __m256i vector / parallel prefix (cumulative) sum with SSE (especially see comments on Z Boson's self-answer, about doing multiple SIMD vectors at once) / Prefix Sum Parallel Algorithm Commented Nov 14 at 12:20

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.