TBB Basics – M. Cihan ÖZER

Here, we will talk about all the key concepts of TBB to get familiar and start building complex structures in the future.

Initialization and Termination of TBB

Initialization and termination of TBB are related with initialization and termination of the task scheduler. Task scheduler is used to run the algorithm templates and task groups. Your goal when you use TBB should be creating more tasks than the threads, so the task scheduler can choose the tasks that will be mapped onto physical threads. The task scheduler uses non-preemptive scheduling (each task is assigned a certain amount of time/cycle and when this time/cycle is done, it is changed with the next task) for this mapping. Thanks to the task scheduler, computationally intensive work is parallelized.

If you are using TBB 2.2 or later, task_scheduler_init is automatically called and initialize the task scheduler. However, you can still explicitly initialize the task scheduler by calling task_scheduler_init for controlling when the task scheduler will be constructed and destructed, for specifying the number of threads that will be used by the task scheduler (You should set this number only when you scale studies during development, not for the production code), or for specifying the stack size for the worker threads. Additionally, if you use TBB along with other threading packages, then you may need to initialize the task scheduler on your own.

Initialization and termination of the task scheduler is an expensive operation. So, putting your task_scheduler_init in your main routine or when a thread is born would be better than creating a scheduler whenever you use a parallel algorithm template.

Base TBB Objects

Before we start talking about parallel algorithm templates, let’s talk about base the TBB objects, such as Body, Range, and so on.

Splittable Concept

Splittable is a concept for the types planned to be split into two pieces (the instance and a newly constructed object). Splittable types require a special type of constructor, SplitableType::SplitableType(SplitableType& st, split). Without this constructor, the type is not splittable.

Splitting constructor takes a reference to the original object and TBB’s split argument as the parameters. This split argument is actually a dummy argument and thanks to it, a splitting constructor can be distinguished from a copy constructor.

Splitting constructors are used for partitioning a range into two subranges that can run parallel (as blocked_range and blocked_range2d), or for forking a function object (a Body) into two function objects (Bodies) that can run parallel (as parallel_scan, parallel_reduce, and so on).

Range Concept

Range is a Split based concept that represents a set of values that can be divided recursively. The Ranges divided into nearly equal parts yield the best parallelism.

You can implement the Range concept as a structure or class by following some certain rules and adding some certain methods giving in the listing below (taken from Intel Range Concept page):

Range Concept
Pseudo-Signature	Semantics
`R::R( const R& )`	Copy constructor.
`R::~R()`	Destructor.
`bool R::empty() const`	True if range is empty.
`bool R::is_divisible() const`	True if range can be partitioned into two subranges.
`R::R( R& r, split )`	Basic splitting constructor. Splits `r` into two subranges.
`R::R( R& r, proportional_split proportion )`	Optional. Proportional splitting constructor. Splits `r` into two subranges in accordance with `proportion`.
`static const bool R::is_splittable_in_proportion`	Optional. If true, the proportional splitting constructor is defined for the range and may be used by parallel algorithms.

Splitting constructor is used to divide a Range into two parts recursively. Splitting constructor can be basic or proportional.

Basic splitting constructor splits the values as evenly as possible. Proportional splitting constructor is optional. It splits the range into two subranges with respect to the given proportion. If you will use proportional splitting constructor, you should also set static const bool R::is_splittable_in_proportion variable as true, so the parallel algorithms could use it. Finally, for the best results, it had better to follow the given proportion with rounding the proportion the nearest integer.

A Range should be recursively split until the parts are efficient enough to be executed serially rather than splitting further. Thus, the type representing a Range should provide a way to control splitting (via bool R::is_divisible() const), e.g., TBB’s blocked_range, which represents a Range type, has a parameter named grainsize specifying the biggest indivisible range.

Range is the backbone of the template algorithms, such as parallel_for, parallel_reduce, and parallel_scan. Since a copy constructor comes with the Range declaration, the default constructor will not be automatically generated. You will need to explicitly define the default or any other constructors if needed.

Partitioner Concept

Partitioner is a TBB type that decides whether a Range should be split more or operated on by a task Body. User defined types should include the following requirements for implementing a Partitioner properly (taken from Intel Threading Building Blocks by James Reinders) :

Pseudo-Signature	Semantics
`P::P( P& p, split )`	Split p into two partitioners.
`P::~P()`	Destructor.
`template <typename Range> bool P::should_execute_range( const Range &r, const task &t)`	True if r should be passed to the body of t. False if r should instead be split.

A partitioner is given to a TBB loop template (parallel_for, parallel_reduce, parallel_scan) as an optional parameter and it defines the strategy for executing the loop. The default behavior for the loop templates is to keep splitting a Range, not necessarily finely, for keeping the processors busy.

Partitioner has the rules used for deciding whether the given Range should be operated on as a whole by the task’s Body or subdivided more. The template loops operate by splitting the Range objects to subranges until they are not divisible anymore. Partitioner objects take the control at this point and introduce some rules for splitting and execution.

Partitioner object’s decision making process is handled by the splitting constructor and should_execute_range() method. When a template loop algorithm needs to decide whether splitting a Range object, associated Partitioner object’s should_execute_range() is called and depending on the result, the current task is applied over the entire range.

Each Range object is associated with a Partitioner object within the any parallel algorithms. Whenever a Range object’s splitting constructor is used, the associated Partitioner object’s splitting constructor also splits the Partitioner object to create two matching Partitioner objects.

4 partition options are offered by TBB:

const auto_partitioner& is the default Partitioner of TBB. It performs sufficient splitting to balance load. This splitting is not necessary as finely as Range::is_divisible() permits. If you use the default partitioner with classes like blocked_range, then the appropriate grain size selection is less important. Acceptable performance is achieved by the default grain size (1) frequently. When it is used with a loop template, range splitting is tried to be minimized, and work-stealing is tried to be increased. Range subdivision is limited to the number of the threads defined by the task scheduler unless a subrange is stolen by an idle thread. Then, it is subdivided to create additional subranges to balance load.

When auto_partitioner is used with blocked_range for a parallel loop, the grain size does not provide an upper bound. simple_partitioner should be used if an upper bound is required.

affinity_partitioner& is similar to the default partitioner, but improves cache affinity by mapping subranges to worker threads. As a result, if a loop is re-executed over the same data set that can fit into the cache, the performance can significantly improve. affinity_partitioner uses proportional splitting when it is enabled.

For affinity_partitioner, it is important to pass the same partitioner object to the loop templates for the optimization. affinity_partitioner improves the performance when:

Computation includes a few operations per data access.
The data acted upon by the loop fits in the cache
The loop is executed over the same data
More than two hardware threads are available

const static_partitioner& distributes range iteration among worker threads uniformly and does not let any possible future work balancing. Subranges are mapped to worker threads similarly to affinity_partitioner. The work distribution and mapping are deterministic and based on the number of range iterations, its grain size, and the number of threads.

When static_partitioner is used, the parallel algorithm distributes the work uniformly across the threads and does not perform additional load balancing. The range is distributed among the threads as subranges that are equal or approximately equal size. The number of subranges is equal to the number of possible threads defined by the task scheduler. These subranges are not split later.

static_partitioner is well suitable when the work is well-balanced and reduces the overhead of the parallel algorithm. If the work is imbalanced, then some performance loss may happen. Thus, it had better to use static_partitioner when parallelizing small, well-balanced workloads that additional load balancing may bring overhead instead of performance improvements.

const simple_partitioner&(default for parallel_deterministic_reduce) This partitioner partitions the Range until it is no longer splittable. Range::is_divisible() is used for this task. Picking the appropriate grain size is important, especially when this partitioner is used with classes like blocked_range.

simple_partitioner splits the range until it cannot be split further (until subrange.is_divisible() returns false). The default grain size is 1, which may make the subranges much too small for efficient execution.

TBB blocked_range

blocked_range is a template class that represents half-open intervals ( [i, j) ) recursively split. blocked_range models the Range concept and blocked_range can be tracked as Split -> Range -> blocked_range in the hierarchy.

The grain size is important for blocked_range type and directly affects how a blocked_ranged is split. If the size of a blocked_range‘s subrange exceeds the grain size, then the blocked_range is split into two. Grain size directly affects the efficiency. A too small grain size may cause scheduling overhead and decrease the efficiency (imagine that you try to parallel hundreds of ranges including a single element). Likewise, a too large grain size may limit the parallelism (imagine that your grain size is so large at a point, you divide your range including millions of data only into two separate subranges).

Finding the correct grain size is not straight forward, it is mostly based on heuristic, but there are some helpful steps that you can use for finding your grain size:

Set your grain size as 10K
Run your algorithm on one processor
Start cutting the grain size parameter in half and see how much the algorithm slows down as the value decreases.
A slowdown of about 5-10% is a good setting for most purposes

TBB blocked_range2d

blocked_range2d also represent a recursively divisible half-open interval, but in 2D range [i0, j0)×[i1, j1). Each axis of the range has its own grain size and blocked_range2d is divisible if either axis is divisible.

On TBB blocked_range2d page, you can see a matrix multiplication example.

parallel_for<Range, Body> Template Function

parallel_for is one of the key template functions of TBB that performs parallel iterations over a range of values. parallel_for breaks the iteration space into chunks and runs each chunk on a separate thread. You can consider it as the parallel version of the classical for loop that runs from i = first to last (last is not included; i < last).

parallel_for has two signatures: void parallel_for( const Range& range, const Body& body ); and void parallel_for( const Range& range, const Body& body, Partitioner &partitioner );

parallel_for takes a function f(t) or a Body. Body is a class containing the operation that will be applied to the each element of the given Range. Body class should provide the requirements given in the table below (taken from Intel’s parallel_for Template Function page):

Requirements for parallel_for Body
Pseudo-Signature	Semantics
`Body::Body( const Body& )`	Copy constructor.
`Body::~Body()`	Destructor.
`void Body::operator()( Range& range ) const`	Apply body to range.

Body::operator() includes the main part of the operation, e.g., it contains the operation performing in the classical for, such as summing the all elements in the array.

parallel_for first splits the given Range into subranges until Range::is_divisible() method returns false. Body is copied for each subrange. Then, Body::operator() is called for each Body-subrange pairs.

parallel_for executes iterations in non-deterministic order when worker threads are available. Otherwise, it executes the iterations from left to right. When it comes to complexity, if the range splits into nearly equal pieces and both Range and Body takes O(1) space, then space complexity is O( Plog(N) ) where N is the size of the Range and P is the number of threads.

parallel_reduce<Range, Body> Template Function

parallel_reduce is used for reduction operations, such as summation, finding min or max, applying logical AND operation, matrix multiplication and so on.

There are some key differences between parallel_for and parallel_reduce. First of all, unlike parallel_for, parallel_reduce requires a splitting constructor for the Body object. Additionally, accumulation and combination of the operation performed should be handled. So, Body object that will be used needs to define a join() method (void join( const BodyObjectType& obj); ). Thus, operator() is not const as in parallel_for.

parallel_reduce invokes the splitting constructor to create subtasks for the worker threads. When the subtask finishes, join() method of the Body object is called to accumulate the result. It is important not to discard the previous operation results for getting proper results.

Since the splitting constructor may run concurrently with the reduction operations, all actions of the splitting constructor must be thread safe, such as shared components like reference counter. However, if the worker threads are unavailable, this split/join operation is not performed.

Because of rounding issues, parallel reduction operations may yield a different answer from a serial version. Partitioners and grain size rules are the same for parallel_reduce as parallel_for. parallel_reduce‘s complexity is O(Plog(N)), where N is the size of the range and P is the number of threads if the range and body take O(1) space, and the range splits into nearly equal pieces.

There are two rules to decide which template function we should use; whether the operation is commutative and associative, and the cost of the reduction type’s construction, i.e., construction of floats is cheap whereas construction of a float matrix is expensive.

If the object construction is inexpensive, then tbb::parallel_reduce should be used even if the reduction operation is not commutative (parallel_reduce works even the reduction operation is not commutative).

If the operation is commutative and the construction is expensive, then tbb::parallel_for and tbb::combine should be used.

If the operation is not associative but a deterministic result is required, then recursive reduction should be used and should be parallelized by using tbb::parallel_invoke.

parallel_scan<Range, Body> Template Function

parallel_scan is the template function that is used to compute parallel prefix. This leads us to a question: What is parallel prefix?

Without diving into the details, prefix sum (also called as scan, cumulative sum, or inclusive scan) takes a sequence of input (x0, x1, x2, … , xn) and produces a sequence of output (y0, y1, y2, … , yn) by using the formula yi = yi-1 + xi, where y0 = x0.

Let’s open the formula for making the things a bit more clear:

y0 = x0
y1 = x0 + x1 = y0 + x1
y2 = x0 + x1 + x2 = y1 + x2
y3 = x0 + x1 + x2 + x3 = y2 + x3
…

So, if we are given an ordered set A{a0, a1, a2, … an-} of n elements and a binary associative operator ⊕, and if this is a prefix sum operation, then we we get {a0, (a0 ⊕ a1), … , (a0 ⊕ a1 ⊕ … ⊕ an-1)}.

For example, if ⊕ is + (regular summation) and the input set is {5, 3, -6, 2, 7, 10, -2, 8}, then the output would be {5, 8, 2, 4, 11, 21, 19, 27}.

Prefix sum has many applications, such as evaluation of polynomials, sorting (radix sort, quicksort, etc.), solving linear recurrences, string comparison, lexical analysis, generating random numbers, n-body problems, tree operations, histograms, and so on.

parallel_scan template function is used for this task in TBB. In real life, it is a bit unlikely that you will face with a such situation and use parallel_scan since this computation is an advanced concept, but if you face with serial dependencies, parallel_scan can be very useful.

There are two options to call parallel_scan. The first options takes a Range, a Body, and an optional Partitioner (void parallel_scan( const Range& range, Body& body [, partitioner] );).

Here, Body, requires void Body::reverse_join( Body& b ) and void Body::assign( Body& b ) methods additionally. reverse_join() merges summary accumulated by Body b into summary accumulated by this, where this was created earlier from Body b by splitting constructor. assign() method assigns summary of Body b to this. summary here contains enough information to compute the prefix operation, such as yi-1.

The other syntax for calling parallel_scan is Value parallel_scan( const Range& range, const Value& identity, const Scan& scan, const Combine& combine [, partitioner] );

The second option is designed to use with functors and lambda expressions. Scan and Combine objects should provide the requirements given in the table below (taken from TBB parallel_scan template function page):

Requirements for Scan and Combine
Pseudo-Signature	Semantics
`Value identity`	Left identity element (y0 = LeftIdentityElement ⊕ a0, in our case) for `Scan::operator()`.
`Value Scan::operator()(const Range& r, const Value& sum, bool is_final) const`	Starting with `sum`, compute the summary and, for `is_final == true`, the scan result for range `r`. Return the computed summary.
`Value Combine::operator()(const Value& left, const Value& right) const`	Combine summaries `left` and `right`, and return the result.

Since parallel_scan has two passes, the system having only two hardware threads can provide only a small speedup. The systems having more than two cores would benefit more from parallel_scan.

parallel_do

Sometimes, we cannot know the end of the iteration space in advance, or the loop body may add more iterations to process before the loop exits, i.e., a linked list (even if it’s usually better to use dynamic arrays instead of linked lists in parallel programming, since linked list items are accessed essentially serially). These cases are handled by TBB’s parallel_do. parallel_do is the replacement of parallel_while of TBB.

We can call parallel_do as void parallel_do( InputIterator first, InputIterator last, Body body[, task_group_context& group] ), or void parallel_do( Container c, Body body[, task_group_context& group] ).

The first form (the sequence form) applies the function object, which is defined by Body, over a sequence specified by first and last InputIterators (in [first, last) manner). If additional work items are needed to be added, tbb:parallel_do_feeder<item> is used.

The second form (the container form) is equivalent to parallel_do(std::begin(c), std::end(c), Body) call.

You can see the requirements for a parallel_do Body (taken from TBB parallel_do reference page):

Requirements for parallel_do Body and its argument type T
Pseudo-Signature	Semantics
Body::operator()( cv-qualifiers_opt T reference_opt item, parallel_do_feeder<T>& feeder ) const OR Body::operator()( cv-qualifiers_opt T reference_opt item ) const	Process a work item. `parallel_do` may concurrently invoke `operator()` for the same body object but different items. The operator may accept `item` by value or by reference, including rvalue reference. The signature with `feeder` permits additional work items to be added. CAUTION Defining both the one-argument and two-argument forms of `operator()` is not permitted.
`T( const T& )`	Optional since C++11. Copy a work item.
`T( T&& )`	Supported since C++11; optional. Move a work item.
`~T::T()`	Destroy a work item.

Constructor of Body should include either a copy constructor, a move constructor, or both. If the Body is designed as not copy constructible, then Body::operator() should accept its argument by value, but if the InputIterator does not satisfy forward iterator requirements, then InputIterator deferencing must provide an rvalue reference. Also, additional work items should be passed to parallel_do_feeder as rvalues.

The algorithm can be passed a task_group_context object, which represents a group of tasks that can be cancelled, or have their priority level set, together, to execute its tasks in this group. By default, the algorithm is executed in a bound group of its own.

parallel_do may not always increase the performance. For example, if the task that will be executed in Body::operator() takes less than 100K clock cycles, then the overhead of parallel_do may decrease the efficiency.

Additionally, if the input items from the input stream do not have random access, then parallel_do is not scalable. In this case, you should either use random access iterators to specify the input stream, consider to use parallel_for, or redesign your algorithm in a way that the Body often adds more than one piece of work ( tbb:parallel_do_feeder<item> becomes very handy for this task).

References

CAUTION