Introduction to Parallel Programming


Here, you’ll find my journey on parallel programming. After postponing so long, I finally began to make my hands dirty and made a start to parallel programming using Intel’s Threading Building Blocks (TBB).

Here, you’ll find my notes on TBB and parallel programming in general. You can refer to the references I share in case you need more details on some parts. I keep these posts mostly for my own future reference, but I hope it also helps you, too.

I. Some Basics

To support parallelization, we spread the workload across multiple processors. These processors can be in the same computer as a multi-core CPU (shared memory based parallel programming is used for this case) or we can use multiple computers (like supercomputers) via message passing (since in this case we wouldn’t have shared memory).

If you are a game developer, most of the time you achieve multi-threading by shared memory based parallel programming models. These models are mainly divided into three types:

* Threading models: These models are constructed based on thread libraries having low-level routines. POSIX Threads (Pthreads), Windows Threads, and Boost Threads are examples to these models. These models are the assembly language of parallel programming. These models provides you high performance and flexibility, but programming effort, debugging time, and maintenance costs are also high. They require you to establish the communication and synchronization between threads on your own. So, you are required you to use mutual exclusion locks, conditional variables, and so on to achieve these.

* Directive based models: These models are based on high-level compiler directives. They extend the threading models and take care of low-level tasks, such as synchronization and communication among the threads, partitioning, worker management, and so on. OpenMP is a well-known and widely used example to these models. Directive based models make parallel programming easier, since the developer will not need to focus on issues, such as deadlocks, data races, false sharing, and so on.

* Tasking models: Unlike the other two models, tasking models require developers to specify the tasks instead of threads. TBB and Cilk++ are some examples to these models. Tasks are lightweight and have a shorter life span than threads. So, starting and terminating a task would be faster than performing these tasks for a thread. They are also implemented at client side of the code.

One important thing while you program your code parallelly, creating lots of threads is not good for the performance of your program. If you have too many threads, partitioning gives each thread too little work and this returns you as overhead of starting and terminating threads.

II. Why Threading Building Blocks (TBB)

For making this decision, I actually referred to A comparison of five parallel programming models for C++ [Ajkunic et al. 2012] and Intel Threading Building Blocks by James Reinders (O’Reilly).

Ajkunic et al. presents a comparison between TBB, OpenMPI, Cilk++, OpenMP, and Pthreads by implementing a matrix multiplication using each parallel programming packages. They compare the packages based on performance (program elapsed time and speed up) and coding effort (lines of code added or changed than the naïve single threated implementation and total number of lines).

According to their results, Cilk++ and TBB performed much better than OpenMP in both performance and speed up. Ajkunic et al. explains that it can be because matrix multiplication is more suitable for task-parallel problems whereas OpenMP is more suited for data-parallel problems. When it comes to coding effort, OpenMP and Cilk++ performed better while OpenMPI and Pthreads require more code modification than other models.

All in all, choosing the best model depends on your problem and development environment. Threading models as Pthreads introduce much more complexity, but also high performance. Models such as OpenMP and TBB make the things a lot easier for developers since they hide low-level management and implementation. While TBB requires more coding than OpenMP, it also provides more control and can be used both for task-parallel and data-parallel problems with high performance. OpenMPI requires the most coding effort and expertise to implement it effectively; however, it can be used both on shared memory systems and distributed systems.

I preferred TBB, because first of all TBB introduce a high-level, task-based approach for parallelization. It makes scalable application development a lot easier. It handles burdens, such as synchronization, load balancing, and cache optimization for you automatically. You specify the tasks and TBB maps the tasks onto threads for you efficiently. Additionally, using task-based approach and avoiding raw threads makes coding easier and more understandable, it also provides you better portability, better performance, and better scalability. Not being forced to handle the low-level management (if not necessary) is always better. So, you can focus on your project, and leave the rest to the responsible systems and libraries. That’s also why we prefer to code in high-level languages like C++, C#, Java, etc. instead of Assembly, right?

Additionally, TBB uses standard C++ (ISO C++) and C++ templates. It does not require any special language or compiler. It support different compilers, different processors, and different operating systems. Also, adding TBB to your project is straight forward and it is compatible with other threading packages, so you can use TBB and some other packages, libraries together. Finally, you can build larger parallel components from smaller parallel components via TBB. TBB supports nested parallelism as well.