There are some key points to develop a good parallelizable program that can be scalable easily and safely. These points are important regardless you use a high-level threading library, such as TBB, or pure threading solution, such as PThreads.

Parallel programming starts by thinking parallel. It’d better if you approach to the problem in a parallel manner, rather than turning your single-CPU-based design and code into a parallel one.

There are some key concepts for thinking, designing, and coding parallel:


Decomposition is about decomposing your problem into concurrent tasks. “concurrent tasks” here mean tasks that can run parallel (at the same time). This is the first thing to do before you design parallel.

We can decompose our problem either in the form of data (data on which we can operate concurrent/parallel), or in the form of task (tasks on which we can operate concurrent/parallel).

Data parallelism is about applying the same transformation (operation) to each piece of data. So, here the data changes, but the operation on the data pieces are always the same, i.e., capitalizing the all characters in a string (Here, we apply the same capitalizing task concurrently to each char element of the string). This is the simplest case you can face in parallel programming.

Task parallelism is about applying different operations to the same data set concurrently, i.e. computing average, standard deviation, and geometric mean of the same set of number.

Pure task parallelism is hard to find. Usually, it is found with data parallelism. We call this special case (task and data parallelism together) as pipelining. In this case, many independent tasks are applied to a steam of data, i.e. automobile assembly lines. Different items can pass through the different stages concurrently and pipelining can give you more opportunity, since you can reroute data or skip steps in special cases. The stream of data flow through the pipeline and a little work is handled at each step.

You can consider pipelining and data parallelism together and have coarse-grained parallelism that each individual (you can consider these “individuals” as the cores of the CPU) will apply the same pipeline of tasks for different data piece concurrently.

You can also divide and group your “individuals” and get a hybrid of data and task parallelism. In this case, one “individual” is assigned to tasks A and B, one “individual” is assigned to task C, and 3 “individuals” are assigned to a heavy task, task D.

As you can see, for achieving to parallelism, you need to:

  • Balance the work load (assigning and moving (if necessary) the “individuals” around
  • Split up a task to more “individuals” (if necessary)

Thus, you need to define tasks and data into a level that you can explain and then split or combine the data, so you can match up with the available resource for handling the work.

TBB works in a similar way, but makes the things easier for you. TBB requires you to specify tasks and how to split them. For example, if you face with a data-parallel task, then you define a single task that you will give all the data. This task is split up automatically by TBB to use the available hardware parallelism.


Another important concept we should be aware of is scalability. Scalability is about structuring the problem to scale well and it is the measurement of how much your program speeds up when you add processor cores. We expect our programs to run faster on two processor cored machines than the single cored ones. Likewise, the programs run faster on four cored machines than the two or single cored ones, and so on.

However, there is a limit that adding more processor cores will not speed up and the program will not scale. Actually, if we reach to this point and force our program to use more cores, then it will result to performance fall since distribution and synchronization overhead will dominate.

If we use other systems, such as raw threads, we need to handle this situation and be sure that we split our data to a reasonable level. TBB, on the other hand, handle these problems internally.

There are Amdahl’s and Gustafson’s laws to decide how much our program can speed up.

Amdahl’s Law tells that a program can run as fast as the sum of the parts that we cannot make run in parallel (serial portions). Since, we cannot make them run in parallel, adding more and more processors loses its significance after some point. Consider that we have 5 segments in our program that takes 100 seconds to run. Let’s assume 3 of these segments are serial portions. We can add infinite number of processors and make 2 segments to run in 0 seconds, but since we cannot make anything for the serial portions, our program cannot run faster than 300 seconds. So, we can speed up the program only 1.66X.

Gustafson takes a different approach than Amdahl and considers the workload. He suggests that when the computers become more powerful, they handle more workload. So, according to him there is no reason to focus on a fixed workload. Let’s focus on our previous case that we have 5 segments (3 of them are serial portions). This time, let’s consider that each segment handles 100 work. So, at the beginning we would handle 500 work. Let’s assume we introduce 4 processors that can handle 100 work. In this way, we increase the work done to 700 work (speedup 1.4X). If we introduce 8 cores, we increase the work done to 1100 (speedup 2.2X). If we introduce N processors, then we would handle 2N100 + 300 work (speedup O(N)).

Both Amdahl’s and Gustafson’s laws are correct. The difference is whether you want to make an existing program run faster with the same workload or with a larger workload. However, making nowadays applications run faster by switching to a parallel algorithm without expanding the problem is harder than making it run faster on a larger problem.


In modern operating systems and programming, processes (programs) are subdivided into multiple threads.

A process usually starts as a single thread and requests (creates) more threads. Threads can be used to logically break down a process into multiple tasks, such as UI and main part of the process.

Threads run independently (similarity with processes) and they are usually not aware of other threads. The key difference between threads and processes is that the threads inside the same process share all the data of the process. So, the task of setting a variable in another thread can be handle by a simple memory access.

This memory sharing also has an obvious benefit; being able to share the work quickly. However, each thread has its own stack and instruction pointer, which is a register pointing to the place in the program where it is running.

There are important points to consider when you use threads, such as assigning the tasks/threads in a way that  you will keep all the cores busy, managing threads (handling threads finishing their tasks, thread creation -via a thread pool, by creating new threads, etc.-), deciding number of threads to run, and so on.

These points are important for multitasking, but not perfectly related with your own program, so ideally you should not bother yourself with these problems as we do not bother ourselves memory alignment, memory layout, stack pointers, and so on. These points should be abstracted away for multitasking too as libraries, compilers, and high-level languages abstract away important points, such as memory alignment, and so on.

Actually, this is the part where TBB comes in handy and takes care of all thread management, so you can only focus on your own program.


Multithreaded programs are nondeterministic. They follow different execution paths each time even if you run them with the same input set. So, it’s our task, as developers, to keep the program correct even if TBB gives some support.

In general, there are some common errors, such as round-off errors (parallel programs get effected by round-off error more), but the main sources of correctness problem are race condition (different threads are reaching to the same data at the same time) and deadlock (threads are waiting each other’s locked resources).

All in all, you need to know 5 terms/concepts for being able to think and code parallel:

  • Decomposition
  • Scaling
  • Correctness
  • Abstraction (Abstracting the low-level things away, and focusing on the real problem/the program’s itself, such as having your own thread management library, using TBB, and so on)
  • Using patterns to apply your solutions (Applying the first 4 rules in a design manner: Finding concurrency, deciding algorithm structures, deciding supporting structures, implementation)