Programming heterogeneous systems is often considered challenging and cumbersome. Therefore, the DEEP programming model provides a dedicated development and runtime environment which makes porting applications to DEEP a snap. If your application supports MPI, you are almost done and your application will benefit from optimized MPI implementations for InfiniBand on the Cluster side as well as for the EXTOLL network on the Booster side. If your application does not use MPI, the task-based OmpSs programming model might even save you from explicitly parallelizing your application yourself.
The DEEP programming model provides a dedicated development- and runtime-environment supporting the distinct hardware features of Cluster and Booster. While on the Cluster side a MPI library specifically optimized for the InfiniBand fabric is provided, on the Booster side the corresponding Booster MPI features EXTOLL support. The latter will be used by the highly scalable code-parts (HSCP) for intra-Booster communication. The choice of MPI supports the fact that the guiding applications of the DEEP Project are all based on the MPI programming paradigm.
The Booster Interface introduces constraints on communication latency. Therefore, the DEEP programming model foresees to trench applications at a boundary involving only less frequent communication with limited data volume. By offloading HSCP including intra-Booster communication, collective operations shall for the most part be restricted to either the Cluster-part or the Booster-part of the applications.
The actual offloading of the HSCP shall be as close to existing standards as possible. Therefore the dynamic process model of MPI-2, namely MPI_Comm_spawn(), is used for DEEP’s offloading mechanism. Besides spawning processes on the Booster Nodes it allows for an efficient exchange of data between Cluster Nodes and Booster Nodes with MPI semantics.
MPI_Comm_spawn is a collective operation performed by a subset of the processes of an application started on the Cluster. An inter-communicator is returned, providing a connection handle to the children. Each child has to call MPI_Init, as usual, and can get access to the inter-communicator via MPI_Get_parent.
An inter-communicator as defined by the MPI standard contains two groups of processes and naturally allows point-to-point communication between a member of one group and a member of the other group. Starting with MPI-2, collective operations have been extended and defined for inter-communicators.
DEEP’s programming model is based on MPI for intra-Booster as well as for intra-Cluster communications. Together with the offloading mechanism and Cluster-Booster communication they form a global MPI, i.e. a heterogeneous MPI implementation that is usable on all node types and allows for communication among Cluster Nodes and Booster Nodes, respectively, and at the same time between the Cluster and the Booster parts of the
Extending an application by DEEP offloading is cumbersome and error-prone. Thus, DEEP enables the OmpSs data-flow programming model to ease application porting to heterogeneous machines. Based on OpenMP, OmpSs exploits task level parallelism and supports asynchronicity, heterogeneity and data movement. While Cluster Nodes and Booster Nodes will be featured by OmpSs, too, it is being extended to support the offload of large, complex tasks from the Cluster to the Booster side of the DEEP System.
OmpSs task model
To use OmpSs, some parts of an application must be taskified. Basically, this is done by annotating the selected code with OpenMP-like pragmas indicating the data read(input) and/or write(output) by each task. Additionally the user can specify one or a series of hardware devices where a given task should be executed, and if data needs to be copied from/to those devices. Various versions of the tasks can exist to target different architectures.
OmpSs annotations are interpreted by the Mercurium source-to-source compiler, which supports Fortran, C, and C++ languages. For each call to the annotated functions the compiler generates a call to the Nanos++ runtime system to create a new task. The result is compiled by a native compiler and linked with Nanos++.
Each time a new task is created its input and output dependencies are matched against those of the already existing tasks. Taking these dependencies into account, the runtime decides on the order of the tasks and whether concurrent execution is allowed, creating a task-dependency graph at runtime. All this information is used to schedule the tasks on the available devices.
OmpSs in DEEP
In DEEP, the OmpSs programming model will run not only at the node level, but also as an abstraction of the global MPI. Pragmas are provided to make the offload of tasks from Cluster to Booster more user-friendly. They hide the necessary coordination and management of two or more sets of parallel MPI processes and send the required data
from one side to the other and vice versa.
For that, OmpSs is extended by pragmas to mark the MPI functions that must be offloaded. The Mercurium compiler and the Nanos++ runtime will cooperate to transparently manage all the data transfers between the MPI processes running on the Cluster and those running on the Booster, making use of the functions MPI_Comm_spawn and MPI_Comm_send/recv. By corresponding annotation the application developer can mark HSCPs to be sent to the Booster allowing MPI-operations within these super-tasks at the same time.