C/C++ code often needs to be ported to and optimized for multicore platforms. This is an overwhelming manual task that depends on developer experience and intuition, often leading to project delays and unplanned expenses for increasing system complexities.
To efficiently tackle this growing problem, Silexica has developed the SLX C/C++ programming technology that helps you Analyze, Optimize and Implement multicore C/C++ applications. This enables you to take full advantage of the available parallelism present in the target platform. SLX C/C++ accepts as input a sequential or threaded C/C++ application and a model of the targeted platform. From this information, it generates parallelization information for the application (details can be found in Section 1.4). Internally, it uses advanced static and dynamic code analyses to detect parallelism patterns that are guaranteed to deliver performance improvements for multicore platforms. The obtained information is presented to the developer in an intuitive manner to facilitate the implementation of a parallel version of the application.
SLX C/C++ consists of three stages: Analyze; Optimize; and Implement.
• During the Analyze stage, the application source code is thoroughly analyzed by applying compiler technology, which identifies underlying code traits, such as data dependencies, that help exposing or prevent parallelism. Extensive information that improves the user’s understanding of the application, such as dynamic call and shared variable graphs, is also generated and visualized.
• During the Optimize stage, the characteristics of a given target platform are taken into account to automatically select portions of sequential code in the application that could efficiently be executed in parallel. Cross-target performance estimation and inter-processor communication costs, among others, are taken into account in this stage.
• Finally, during the Implement stage, detected parallelism is extracted in the form of parallelization hints that guide users in rethinking the application as a parallel specification, as well as in creating a parallel implementation suitable for the platform resources. Hints are generic coding suggestions that will tell, for example, the source code line boundaries of a newly discovered parallel task by pointing at the original user source code. Furthermore, additional information and code generation is also possible for some supported parallel programming paradigms. For the OpenMP1 parallel programming API, hints are automatically translated into a full parallel application by means of integrated source-to-source code generation. Generated code can then be fed to a regular OpenMP compiler.
To exploit the full potential of a contemporary multicore platform, applications have to be properly parallelized. This firstly requires full understanding of source code inter-dependencies and awareness of the computing and communication resources available in the target platform. Then, developers have to make complicated decisions aiming at re-architecting the code in a way that parallelism is effectively exposed. Such decisions are based on experience but rarely supported by actual facts or data, thus developers can not predict the impact of a code modification. Finally, a multicore port of the application is created by hand, normally by adapting it to a parallel programming paradigm or model of computation (MoC) that not only needs to match the application’s domain but also requires advanced programming skills and expertise.
Developing parallel applications is considerably more difficult than writing traditional sequential code, and many times developers end up misconceiving the software design or underestimating the software inter-dependencies. The current programming practices for describing parallel applications involve multiple manual steps that are time-consuming, error-prone, and lead to low software productivity. In general, there is no widely accepted solution available yet and many issues remain open, especially in the embedded domain:
• It is difficult to identify beneficial parallelism patterns.
• Frameworks often fail to extract parallelism in languages like C, which allow the use of pointers.
• Parallelizing frameworks do not normally take into account the characteristics and resources of the targeted embedded platform.
SLX C/C++ provides solutions to the challenges mentioned in Section 1.1. It offers unique capabilities that make it the state-of-the-art solution for parallelism extraction:
• Identifies interesting patterns of parallelism such as Pipeline Level, Data Level, Task Level, and Computation Offloading Parallelism.
• It strives to support the full ANSI/ISO C (C89, C99) standards.
• It takes account of the characteristics of the underlying embedded platform. This allows for much more accurate estimations on the benefits of applying patterns of parallelism for accelerating and efficiently executing the targeted application on a given multicore platform.
Silexica offers a unique solution for automatic, high quality, parallelism extraction out of C/C++ applications, when executed on a given target platform. SLX C/C++ performs C code partitioning by analyzing control and data flow within the original sequential code, exposing a maximum amount of parallelism which is reported with parallelization hints and OpenMP pragmas.
SLX C/C++ supports automatic parallelism detection for four distinct forms of parallelism: Task Level Parallelism (TLP), Data Level Parallelism (DLP), Pipeline Level Parallelism (PLP) and Computation Offloading. Examples of TLP, DLP, PLP, and offloading instances are illustrated in Figure 1.1.
In Task Level Parallelism, a computation is divided into multiple processes that operate in parallel on different input data sets, as Figure 1.1a illustrates.
Data Level Parallelism is a form of parallelism typically found in scientific and multimedia applications. Here, a given computation is replicated into multiple processes that operate on different input data sets in parallel, as Figure 1.1b shows. The main goal in DLP is to split the iteration space of a loop into multiple workers as long as there are no loop-carried dependencies.
In Pipeline Level Parallelism, a computation is broken down into a sequence of processes (also called pipeline stages), which are contained within a loop. By doing this, it is possible to execute pipeline stages in parallel, as Figure 1.1c illustrates.
In computation offloading, a computational task is assigned from a host processor to an accelerator. This requires moving the necessary data from the host memory to the accelerator memory, performing the computation on the accelerator, and transferring any result data back to the memory of the host core. This allows the host core to perform other tasks while the accelerator is performing the computation. This kind of parallelism is shown in Figure 1.1d.
While frequently present, the inherent application parallelism is often obscured by the sequential implementation, semantics or coding style. SLX C/C++ exposes this parallelism and performs whole-program analyses to help the user to decide on the best parallel representation of the application for a given target multicore platform.
Figure 1.2 shows the main components of the SLX C/C++ methodology. The inputs are a sequential or threaded C/C++ application and a model of the target multicore platform. This model describes relevant details of the platform such as the number of central processing units (CPUs) and DSPs, execution costs of the instruction set, the memory architecture, and communication costs. Then a program representation3 is constructed, which combines the outcomes of static and dynamic analyses. While the static analysis is based on compile-time information such as the complete control flow, the dynamic analysis is based on runtime information such as a list of executed functions, basic block execution counts and memory accesses involving pointers. The dynamic information is obtained by instrumenting the intermediate representation (IR) generated by the compiler and executing the resulting binary to generate a trace. Then the Program Model is analyzed by algorithms that extract different forms of parallelism and this information is provided in the form of parallelization hints.
Performance information is fundamental for parallelization. It is used to identify computationally intensive functions (parallelization candidates), and to perform a cost-benefit analysis to verify the potential speedup of parallelizing a given candidate. Actual performance estimation is based on a Microarchitecture-aware Cost Table model as described in Section 188.8.131.52. This estimation considers the execution count of every statement and a calculated cost for each type of operation in the target platform. The cost of every instruction is provided by the input platform model. This type of estimation has the advantage of providing information at the C statement granularity, which is the minimum granularity considered for parallelization.
For Microarchitecture-aware Cost Table profiling, SLX C/C++ uses advanced compiler techniques to statically and dynamically analyze the application source code and determine platform-dependent execution costs. The estimation engine uses abstract processor models that specify storage resources and functional units with their associated sets of supported operations. Then the effect of the target compiler is simulated by the estimation engine, which applies instruction lowering, selection and scheduling, with a high level of accuracy. It considers hardware characteristics like vector capacities (SIMD), pipelining effects, addressing modes, predicated execution, among others. Furthermore, it also considers software-related costs like those imposed by calling conventions, C external library calls and register spilling. All in all, this results in a very accurate estimation result with efficient execution speed.
In order to extract all useful parallelism of the application, the SLX C/C++ analyzes one parallelization candidate at a time to expose the previous parallelism patterns by using state-of-the-art heuristics.
Using C/C++ code statements as the minimum granularity for partition extraction provides great flexibility. It also allows a straightforward correlation between the partitions and the original source code, to enable an easy derivation of a parallel representation. The output of the parallelism discovery stage of our approach are source-level hints that guide the developer in the process of deriving a parallel representation of the application.
After parallelism has been discovered, it is possible to implement a multicore version using different parallel programming APIs. Choosing an API requires considering factors like the API fitness for a given application domain, availability of target compilers for the chosen API, and OS and runtime compatibility. In the current SLX.cloud release support for OpenMP is provided. This guides the user during the implementation process, as follows:
• SLX C/C++ automatically generates OpenMP pragmas corresponding to a subset of the OpenMP 4.5 specifications. The original sequential C/C++ code section is automatically annotated with the identified pragmas. All supported OpenMP pragmas for automatic generation using SLX C/C++ are described in Section 4.3.
1 Open Multi-Processing, see http://www.openmp.org
3 The Program Model describes the application in terms of performance information and the control and data dependency relationships between the C statements in the computationally intensive functions.