1The Brief

C/C++ code often needs to be ported to and optimized for multicore platforms. This is an overwhelming manual task that depends on developer experience and intuition, often leading to project delays and unplanned expenses for increasing system complexities.

To efficiently tackle this growing problem, Silexica has developed the SLX C/C++ programming technology that helps you Analyze, Optimize and Implement multicore C/C++ applications. This enables you to take full advantage of the available parallelism present in the target platform. SLX C/C++ accepts as input a sequential or threaded C/C++ application and a model of the targeted platform. From this information, it generates parallelization information for the application (details can be found in Section 1.4). Internally, it uses advanced static and dynamic code analyses to detect parallelism patterns that are guaranteed to deliver performance improvements for multicore platforms. The obtained information is presented to the developer in an intuitive manner to facilitate the implementation of a parallel version of the application.

SLX C/C++ consists of three stages: Analyze; Optimize; and Implement.


1.1The Challenge

To exploit the full potential of a contemporary multicore platform, applications have to be properly parallelized. This firstly requires full understanding of source code inter-dependencies and awareness of the computing and communication resources available in the target platform. Then, developers have to make complicated decisions aiming at re-architecting the code in a way that parallelism is effectively exposed. Such decisions are based on experience but rarely supported by actual facts or data, thus developers can not predict the impact of a code modification. Finally, a multicore port of the application is created by hand, normally by adapting it to a parallel programming paradigm or model of computation (MoC) that not only needs to match the application’s domain but also requires advanced programming skills and expertise.

Developing parallel applications is considerably more difficult than writing traditional sequential code, and many times developers end up misconceiving the software design or underestimating the software inter-dependencies. The current programming practices for describing parallel applications involve multiple manual steps that are time-consuming, error-prone, and lead to low software productivity. In general, there is no widely accepted solution available yet and many issues remain open, especially in the embedded domain:


1.2The SLX C/C++ Solution

SLX C/C++ provides solutions to the challenges mentioned in Section 1.1. It offers unique capabilities that make it the state-of-the-art solution for parallelism extraction:

Silexica offers a unique solution for automatic, high quality, parallelism extraction out of C/C++ applications, when executed on a given target platform. SLX C/C++ performs C code partitioning by analyzing control and data flow within the original sequential code, exposing a maximum amount of parallelism which is reported with parallelization hints and OpenMP pragmas.


1.3Types of Parallelism

SLX C/C++ supports automatic parallelism detection for four distinct forms of parallelism: Task Level Parallelism (TLP), Data Level Parallelism (DLP), Pipeline Level Parallelism (PLP) and Computation Offloading. Examples of TLP, DLP, PLP, and offloading instances are illustrated in Figure 1.1.

../common/figures/tlpex.svg

1.1a TLP

../common/figures/dlpex.svg

1.1b DLP

../common/figures/plpex.svg

1.1c PLP

../common/figures/offloadex.svg

1.1d Offloading

Figure 1.1: Parallelism patterns, where P stands for Process, In for Input and Out for Output.

Task Level Parallelism


In Task Level Parallelism, a computation is divided into multiple processes that operate in parallel on different input data sets, as Figure 1.1a illustrates.

Data Level Parallelism


Data Level Parallelism is a form of parallelism typically found in scientific and multimedia applications. Here, a given computation is replicated into multiple processes that operate on different input data sets in parallel, as Figure 1.1b shows. The main goal in DLP is to split the iteration space of a loop into multiple workers as long as there are no loop-carried dependencies.

Pipeline Level Parallelism


In Pipeline Level Parallelism, a computation is broken down into a sequence of processes (also called pipeline stages), which are contained within a loop. By doing this, it is possible to execute pipeline stages in parallel, as Figure 1.1c illustrates.

Computation Offloading


In computation offloading, a computational task is assigned from a host processor to an accelerator. This requires moving the necessary data from the host memory to the accelerator memory, performing the computation on the accelerator, and transferring any result data back to the memory of the host core. This allows the host core to perform other tasks while the accelerator is performing the computation. This kind of parallelism is shown in Figure 1.1d.

While frequently present, the inherent application parallelism is often obscured by the sequential implementation, semantics or coding style. SLX C/C++ exposes this parallelism and performs whole-program analyses to help the user to decide on the best parallel representation of the application for a given target multicore platform.


1.4Parallelization Methodology

Figure 1.2 shows the main components of the SLX C/C++ methodology. The inputs are a sequential or threaded C/C++ application and a model of the target multicore platform. This model describes relevant details of the platform such as the number of central processing units (CPUs) and DSPs, execution costs of the instruction set, the memory architecture, and communication costs. Then a program representation3 is constructed, which combines the outcomes of static and dynamic analyses. While the static analysis is based on compile-time information such as the complete control flow, the dynamic analysis is based on runtime information such as a list of executed functions, basic block execution counts and memory accesses involving pointers. The dynamic information is obtained by instrumenting the intermediate representation (IR) generated by the compiler and executing the resulting binary to generate a trace. Then the Program Model is analyzed by algorithms that extract different forms of parallelism and this information is provided in the form of parallelization hints.

../figures/slxparflow.svg

Figure 1.2: Overview of the methodology applied by the SLX C/C++.


1.4.1Performance Estimation

Performance information is fundamental for parallelization. It is used to identify computationally intensive functions (parallelization candidates), and to perform a cost-benefit analysis to verify the potential speedup of parallelizing a given candidate. Actual performance estimation is based on a Microarchitecture-aware Cost Table model as described in Section 1.4.1.1. This estimation considers the execution count of every statement and a calculated cost for each type of operation in the target platform. The cost of every instruction is provided by the input platform model. This type of estimation has the advantage of providing information at the C statement granularity, which is the minimum granularity considered for parallelization.


1.4.1.1Microarchitecture-aware Cost Tables

For Microarchitecture-aware Cost Table profiling, SLX C/C++ uses advanced compiler techniques to statically and dynamically analyze the application source code and determine platform-dependent execution costs. The estimation engine uses abstract processor models that specify storage resources and functional units with their associated sets of supported operations. Then the effect of the target compiler is simulated by the estimation engine, which applies instruction lowering, selection and scheduling, with a high level of accuracy. It considers hardware characteristics like vector capacities (SIMD), pipelining effects, addressing modes, predicated execution, among others. Furthermore, it also considers software-related costs like those imposed by calling conventions, C external library calls and register spilling. All in all, this results in a very accurate estimation result with efficient execution speed.


1.4.2Parallelism Discovery

In order to extract all useful parallelism of the application, the SLX C/C++ analyzes one parallelization candidate at a time to expose the previous parallelism patterns by using state-of-the-art heuristics.

Using C/C++ code statements as the minimum granularity for partition extraction provides great flexibility. It also allows a straightforward correlation between the partitions and the original source code, to enable an easy derivation of a parallel representation. The output of the parallelism discovery stage of our approach are source-level hints that guide the developer in the process of deriving a parallel representation of the application.


1.4.3Implementing the Parallel Application

After parallelism has been discovered, it is possible to implement a multicore version using different parallel programming APIs. Choosing an API requires considering factors like the API fitness for a given application domain, availability of target compilers for the chosen API, and OS and runtime compatibility. In the current SLX.cloud release support for OpenMP is provided. This guides the user during the implementation process, as follows:

1 Open Multi-Processing, see http://www.openmp.org

3 The Program Model describes the application in terms of performance information and the control and data dependency relationships between the C statements in the computationally intensive functions.