link to homepage

Navigation and service

Connecting hardware and applications on the way to Exascale

Behind the Scenes of the DEEP Software Teams

In IT, the software is the ultimate connector between the hardware and the applications. Due to the co-development approach in DEEP, the same applies to us as well, of course. Norbert, team lead system software, and Vicenç, team lead programming environment, let us have a glimpse behind the scenes of the software team. The hard work of their teams enables application developers to actually make use of the innovative DEEP hardware architecture and to benefit from the system as much as possible.

When talking about the software work done in the DEEP project there are basically two parts: the system software and the programming environment. Let’s start with the system software. Norbert, can you roughly sketch your approach and go into detail on the research that is very specific to the DEEP architecture?

Norbert Eicker

Norbert: It is important to emphasize that our approach starts from the requirements of the application software developers. Given the fact that all the applications investigated within the project – and most applications in HPC in general – are based on the MPI programming paradigm it became obvious to us early on that it is our best shot to use MPI as the lowest abstraction layer of DEEP's heterogeneous hardware architecture.

Interestingly, already early versions of the MPI standard provided all the interfaces that allow for dynamic creation of new processes and global communication in heterogeneous platforms. Thus, there was no need for us to extend the standard or to even come up with a proprietary extension of the standard – which by the way is a huge advantage for application developers as it makes their lives way easier.

Fortunately ParTec's ParaStation MPI – the implementation of the MPI standard we rely on within the project – has a very flexible plugin architecture. This allows us to easily implement all the extensions demanded by the heterogeneous hardware architecture without the need of touching the architectural foundations of ParTec's MPI implementation. The result of these efforts is a global MPI that allows for communication within the Cluster, within the Booster and across both parts of the system. Furthermore, by employing MPI's MPI_Comm_spawn() call, it even enables MPI applications to dynamically offload work from the Cluster side to the Booster side and vice versa.

It seems like the DEEP exascale-ready system is a fairly easy to control beast for an application software developer – but what did it take to develop it like that?

Norbert: Indeed, from the high-level point of view of an application developer it is. Underneath this surface, however, a plethora of other challenges are lurking. It starts with an efficient communication between the Cluster and the Booster, touches optimisation of communication on both sides of the system, affects the management of heterogeneous resources and ends at the very low level with the question on how to boot the KNCs in the Booster Nodes without having a host processor there. E.g. we were able to solve the remote-boot problem by utilising EXTOLL's SMFU feature, which allows to transparently forward the PCIe protocol via EXTOLL from the Booster Interface (BI) to the KNC in the Booster Nodes. In this way the BI's Xeon processor can act as the host processor for up to 16 KNCs in the adjacent chassis of the Booster. By this means we were able to spare an additional processor in the Booster Nodes and thus increasing power efficiency significantly.

This Cluster-Booster communication you just touched upon is very peculiar for the DEEP architecture. Can you explain in more detail how it works?

Norbert: The specific challenge for the Cluster-Booster communication comes from the fact that the DEEP architecture is heterogeneous in a two-fold way: Besides having heterogeneous processors (Intel Xeon in the Cluster, Intel Xeon Phi in the Booster) also the interconnect technology differs between the Cluster and the Booster. While the Cluster relies on an InfiniBand interconnect, the Booster fabric employs the EXTOLL technology developed at University of Heidelberg. Both fabrics meet in the Booster Interface, which has to convert the communication protocols. We have invested significant efforts in order to optimize this conversion. Ultimately, by using innovative features of the EXTOLL technology we managed to take the BI's processor quasi completely out of the game doing most of the conversion in the local PCIe fabric. The corresponding software – the Cluster-Booster protocol – then builds yet another plugin of the ParaStation MPI enabling MPI to work across both fabrics, too.

Vicenç: Thanks to this work, all the low-level details involved in the Cluster-Booster communication are encapsulated below the MPI layer. This is a great accomplishment, since it makes it completely transparent for the runtime of the programming model and also for the applications.

Speaking of the programming model: Which one do you use and have you adapted it to the DEEP system?

Vicenç: We rely on the OmpSs programming model developed at Barcelona Supercomputing Centre. For DEEP we have extended it with flexible offloading features, which makes it easier for the application developers to try different partitions and use the one that works best for each application. Moreover, our runtime system also supports the OpenMP programming model, which can be used inside the offloaded code to make the most of the Xeon Phi processors. So as you can see our key focus is to keep a system like the DEEP machine programmable and manageable for application developers.

Can you elaborate a bit more on this DEEP offload model? How does it work and what are the advantages for a DEEP user?

Vicenc: The idea behind the DEEP offload is conceptually very similar to the Intel Offload: It is all about the possibility to annotate a portion of code and offload it to another device. However, there are two key differences: Firstly, with our approach we can dynamically allocate the set of resources we want to use to offload the work. And secondly, our offload is not limited to simple computational kernels. We can offload a kernel of an arbitrary complexity that might contain MPI calls, I/O calls, etc. This enables the offload of large portions of an application that can benefit from the high performance EXTOLL network provided by the DEEP system. Moreover, our approach is fully integrated with our OmpSs/OpenMP runtime, which can be used on both the offloading and the offloaded parts of an application.

Although now it seems obvious and natural, the original design of the offloading was a challenging process: It was kind of tricky to choose the right level of abstraction for the offloading primitives. We strongly believe we took the right decision here, as we were able to make sure that we can use the DEEP offload on other systems and applications as well.

From what you’ve described so far: Your work seems to be the essential link between the hardware part of the project and the work done with respect to the applications?

Vicenç: Yes, definitely. There were intensive co-design efforts necessary between system software and hardware. Only this way you can develop an efficient system software stack that can exploit the specific hardware features. But obviously there is also thorough co-design at work between the teams developing the programming model and the application developers to provide a convenient way to exploit the heterogeneous hardware. Not to mention the close co-operation between the two software teams that ensures that the integration of system software and programming model is done as smoothly as possible.

What is the current status of your efforts?

Norbert: From a functional point of view the implementation of all extensions is complete. We have a well performing Booster-MPI utilizing EXTOLL on the KNC platform, we can dynamically extend the resources allocated to an application, spawn additional processes into these resources and allow to do communication in between the Cluster and the Booster via the so-called Cluster-Booster Protocol. Due to the late arrival of the final hardware of the DEEP Booster we have some opens with respect to fine-tuning and performance optimisation. But these efforts have started with the availability of a small system based on the production version of the hardware and will be finalized in time once the DEEP Booster is available.
Vicenc: On the offload side, the main features have already been implemented a year ago. During the last months we have focused on several optimisations based on the needs of the various applications in the project. These included e. g. reverse offload, early-release of dependences to optimize I/O offload and similar. The only feature not yet implemented is the integration with the resource manager that will allow us to dynamically allocate the booster nodes.

Final question: Are these software developments limited to the DEEP architecture? Or are there aspects the HPC community of even other fields in ICT could benefit from?

Vicenç: It is definitely useful outside of the project. By using the MPI standard as the interface between system software and programming model, we ensure that the DEEP collective offload and the work done on the applications can be used on other systems. We are actually very proud of that and hope the community finds it useful as well.

Norbert: In fact we expect that such efforts will become more important on our path to Exascale, anyhow. From today's point of view it is very likely that future HPC systems will be heterogeneous. And the question on how to program these systems is still open even though we cannot expect that MPI will go away quickly. The DEEP programming model provides an approach to face the challenge of how to programme a heterogeneous system given an application that is based on MPI. The resulting software infrastructure might even be used on heterogeneous systems that are significantly different from the DEEP systems. Therefore I don't see any kind of limitation here.

Thank you for the interview, Norbert and Vicenc!

Norbert Eicker is Professor for Parallel Hardware and Software Systems at Bergische Universität Wuppertal and head of the research group Cluster Computing at Jülich Supercomputing Centre (JSC). Before joining JSC in 2004, Norbert was with ParTec from 2001 on working on the Cluster Middleware ParaStation. During his career he was involved in several research and development projects including the ALiCE-cluster in Wuppertal, JULI and JSC's general purpose supercomputer JuRoPA. Currently he is acting as the chief architect for the DEEP and DEEP-ER projects. Norbert holds a PhD in Theoretical Particle Physics from Wuppertal University.

Dr. Vicenç Beltran is a senior researcher at BSC. He is currently working on distributed programming models for HPC. He received his engineering degree (2004) and Ph.D. (2009) in Computer Science from the Technical University of Catalonia (UPC). His research interests include programming models and domain specific languages for HPC, operating systems and performance analysis and tools. He has worked in a number of EU and industrial projects.