link to homepage

Navigation and service

Compute more – Consume less

Axel Auweter, team lead energy efficiency in the DEEP project, talks about how important his field of research is on the way to build an exascale-ready supercomputing system; he explains why it is not only about power-saving hardware and performance optimisation but a great deal about monitoring and why he transformed from a computer scientist into a plumber throughout the project.

Axel Auweter

Energy efficiency is an integral part of the DEEP project. What was the project’s motivation for making this aspect one of the major focus topics?

Supercomputers today have become so powerful and large that they consume massive amounts of electrical energy. If we continued to grow machines like in the past, an Exascale class computer would need its own power plant. This would a) be quite a drain in terms of money and b) not be backed by social consensus – for good reasons: Wouldn’t it be strange if climatologists consider supercomputers the most valuable resource in fighting climate change while the machines themselves become one of the largest contributors to global warming?

Valid reason! So how do you actually tackle the energy efficiency challenge within the project then?

Improving energy efficiency turns out to be more challenging than you might imagine. It is no longer enough to focus solely on the system hardware. Instead we have to look at the full chain from the data centre buildings, the system hardware to software like the operating systems and obviously to the applications being run on the supercomputer. You will only be successful if you follow an all-encompassing approach to energy efficient HPC. A very important piece in this mosaic approach is actually a very fine-grained monitoring system we have come up with.

Let’s talk about the hardware side of things first. What does DEEP do in this respect?

First of all, we build the system using only new components with state-of-the-art technologies. The growing mobile market has been really a key driver towards energy efficient chip design – which now also benefits HPC research and developments.

Secondly, we use a direct liquid cooling solution developed by our partner Eurotech. Cooling with water instead of air is more than 200x more efficient. This way, we can keep components sufficiently cool even when the cooling water temperatures are beyond 40ºC. And cooling with warm water is possible anywhere on earth even in summer without the use of energy-hungry chillers.

Water and computers: Doesn’t that still provoke quite some concerns with people in charge of operational safety. Not to mention the effort…

There is no free lunch. Adapting for water-cooling will definitely require you to rethink your building infrastructure. But it’s doable and it’s worth it – even if you sometimes have to swap your computer keyboard for some heavier mechanical tools (smiles).

Water leakages on the other side are not an issue at all – contrary to popular belief. At the Leibniz Supercomputing Centre we were first to prove that the technology works and is safe even for large production systems: Our SuperMUC system and its predecessor CooLMUC have been in operation for several years now without any problems.

Yet, there was some learning process involved: We have found out a lot about retaining the chemical quality of our cooling water to make liquid cooling even safer in the future. For example, we figured that in addition to our thorough power monitoring, it is absolutely necessary to also continuously monitor water properties like pH and conductivity.

Cooling DEEP Cluster and Booster nodes with waterCooling DEEP Cluster and Booster nodes with water

Does that mean you’re a computer scientist, a plumber and a chemist at the same time? Sounds like a lot of fun!

That’s indeed true. Although I didn’t expect that in the beginning, I really like the interdisciplinary nature of the DEEP project. On a side note, Chemistry was one of my favourite subjects in school.
By the way: Most of the knowledge we gather within the project is made public! We really hope that the community can learn from our experiences and that direct water-cooling will prevail in the end.

You’ve also just mentioned the monitoring system. What you’ve alluded to so far, it seems to be a really sophisticated system. How does it work, can you go into more detail, please?

We strongly believe that centrally collecting monitoring data from all sources is a key to understand how HPC systems work. This is a prerequisite to tuning the system for better energy efficiency. Even though that sounds more than obvious, fine-grained monitoring is not common practice in HPC.

So, the first step was to thoroughly instrument the system with power sensors to give us a better understanding of where and when most of the energy is spent in the system. However, we do not only monitor the HPC system, but also collect for example information from the cooling system that is typically under the responsibility of the data centre infrastructure teams.

Now you can imagine, that the monitoring data we are collecting in DEEP is by far bigger than in any other supercomputer. And it gets even bigger: While traditional monitoring solutions collect data typically at most once per minute, our system handles important data, such as power sensors, at sub-second rates.

Wow, that sounds like we’re actually talking Big Data here. How do you handle that?

Yes, indeed. Our system collects such large amounts of sensor data that storing the data in a single central database would easily hit a scalability limit. Therefore, we are borrowing solutions like distributed No-SQL databases and lightweight messaging protocols from the big data community in order to handle the number of sensors and the pace at which sensor data is acquired.
An additional challenge is that some of the data we’re collecting has to be processed in real-time to detect and react to safety-critical events while others is just stored for doing statistical analysis later in time.

It sounds like a lot of the research you pursue is actually very useful for current HPC data centres. What are your top five tips?

  • Make your colleagues, customers, and suppliers aware of your focus on energy efficiency, e.g. report the energy consumption of each compute job back to your users.
  • If you haven’t done so already, get your data centre infrastructure ready for direct liquid cooling.
  • Implement a thorough monitoring, but never trust your sensors until you’ve calibrated them and validated their measurements.
  • In particular in building infrastructure automation, the world is full of proprietary solutions. With supercomputers becoming more and more tightly integrated with their surrounding building infrastructure, you will certainly start to value open standards and interfaces.
  • Help your users optimize their applications for your systems. Performance optimization is still the best approach to improving the energy-to-solution of applications.

Thank you for the interview, Axel!

Axel Auweter joined the high-performance computing division of Leibniz Supercomputing Centre (LRZ) as a research associate in 2010. He is responsible for LRZ's research activities on energy efficient HPC system design and operation. In this role, Axel currently also acts as leader of the energy efficiency team in the DEEP project. His background is in computer architecture, system level programming and operating systems.