By increasing chilled water temperatures by 4.4 degrees Celsius, researchers at Argonne's Leadership Computing Facility (ALCF) have managed to reduce power costs by an estimated 800 kilowatts -- US$50,000 -- per year.
The ALCF operates an IBM Blue Gene/P supercomputer called Intrepid, which launched in June 2008 as the third fastest computer in the world.
According to ALCF researchers, cooling the 557-teraflop machine currently consumes more electricity than it requires to run.
iTnews sat down with ALCF director Pete Beckman and ALCF project manager Jeff Sims to discuss the facility's innovative power management techniques and their potential for the enterprise.
Please introduce yourselves and your roles at ALCF.
Pete: I am the director of the ALCF and I am also a research scientist in the Math and Computer Science division.
Jeff: I'm a project manager at Argonne, so as projects come up, I help get them organised and follow through till the end. My background is in engineering.
Could you tell us a little about Argonne; why was Intrepid installed and what are its goals?
Pete: The DOE maintains two very large computing facilities for the largest, fastest computers of their kind for open science. The Argonne facility is designed to provide supercomputer cycles for the most challenging computational problems around the country and around the world.
Every year, people submit proposals asking for time on the machine, and the best proposals are then given time. Anyone in the U.S. who is willing to publish their results in an open way can apply.
We have people from the National Science Foundation, the National Institute of Health and also collaborators from the U.K. and France who apply and get time on our supercomputer. We have projects that span nanotechnology and biology at a molecular scale, all the way up to scales of galaxies and stars.
Why is power consumption a particular concern?
Jeff: I'd answer that in three different ways. Number one, the DOE strives to be energy-efficient in everything that they do.
Number two, next-generation machines will have even greater electrical demands, so we're trying to do everything that we can now to engineer more efficient solutions so we can kind of stay ahead of the wave of more power-hungry machines.
Three, reduced operations cost allows us to procure more hardware. The less we pay in our power bill, it frees up money so we can buy more hardware.
I understand that cooling accounts for a significant portion of Intrepid's power consumption. Can you please describe the cooling process and how you plan to reduce its energy requirements?
Jeff: Since the Blue Gene/P has high air-cooling demands -- about 5,000 cubic feet per minute per rack -- the conventional computer room air-conditioner, or crack units, aren't really effective in our application.
We've employed a high volume building-type air handler system to pressure the underfloor with about 64°F (17.8°C) air. That 64°F air gets pulled up through the machine, and that's what cools the machine.
The 64°F air is produced by passing the air through cooling coils that chilled water flows through. The production of that chilled water that flows through those cooling coils is the most costly part of the cooling process.
During our fall, winter and spring months in northern Illinois [where ALCF is located], water is cooled by the environment, then we pull that back into the room and that's used in the cooling coils to cool off the air.
This process is called waterside economising, or free cooling. It's not really a new technique, but it's something that nowadays, with green building techniques and sustainability, we have to pay close attention to.
During the warm, summer months, Mother Nature isn't going to cool off the water for you. In that case, we use what's called mechanical cooling where a centripetal chiller compresses refrigerant and creates cold water -- kind of like your refrigerator at home.
That compressor is very power hungry, especially when you're talking about 600 to 800 tonnes of cooling in our case. That's the thing that uses a lot of electricity so we're trying to minimise the period where we run those compressors.
What we're trying to do now is determine the warmest temperatures that we can dump into those cooling coils so we can do two things. First, we can maximise the free cooling period.
The second thing is that the warmer temperature allows us to run the chillers less, and less hard. The compressors don't have to work as hard to provide 55°F (12.7°C) water compared to 46°F (7.8°C) water.
How are you able to use warmer water to cool the machine?
Jeff: Engineers by their training are conservative people. Using the best guesses that the vendor has, they come up with a design that tells you what chilled water temperature you should run at.
You find out after you put the computer in that your mixture of applications probably doesn't drive the computer to the peaks that were used in the engineering design.
What we're understanding now with our history is: how does the computer actually run and what chilled water temperature does it really need to be cooled off to.
So far, we've been able to raise the chilled water from its original design temperature of 46°F (7.8°C) to up to 54°F (12.2°C).
Ballpark numbers is that that's saved about 10 to 15 percent of the electrical demand on the chillers and it extends the free cooling period by as much as two months each year. Now we can do [free-cooling for] at least eight months out of the year with 54°F water.
In Illinois, if you can get your chilled water requirement up to 70°F to 75°F (21.1°C to 23.9°C), you can free cool the entire year. That's an interesting thing that we're talking to IBM about: using warmer temperatures on future machines so we can maximise our free cooling period.
How much power and money is the ALCF saving by using free cooling?
Pete: During the free cooling period, it could be upwards of US$25,000 a month. It's a third of the machine power [that has been saved], so maybe 300 to 400 kilowatts.
But let's say this machine uses on average a megawatt of power. Next generation machines and the generation after that could be using upwards of 40 megawatts of power.
Little tweaks that we're doing right now may not seem to have a huge impact, but we need to learn from what we're doing and optimise this for future systems, because in future systems, doing these little changes will literally be [saving] millions of dollars a year.
Read on to page two to find out how these techniques can be applied to enterprise computing.
How will these techniques be implemented in future systems?
Pete: For future machines, there are a lot of ideas being thrown out there about how to optimise cooling of computers, including thermal energy storage tanks, combining heat and power plants, geothermal cooling and ice slurry technologies.
We're working with IBM right now about the design of these computers. The current generation of Blue Gene/P uses air-cooling. The next generation will be water-cooled: we'll bring the water directly into the rack and there will be pieces of tubing that touch the hot parts of the computer, and so we're much, much more efficient because we can pull the heat off directly.
For our next-generation machines, we are working with engineers from the University of Illinois on how to optimise chilled water temperatures for next-generation computing hardware, and how to maximise our free cooling period.
What are some other techniques that you're using to reduce the power consumption of Intrepid right now?
Jeff: One of the things we're doing now is modifying the consumption profile based on the power cost of electricity.
When you do something really computationally intensive, the computer actually draws more power. We're trying to understand the applications that need more power, and run those in patterns that will allow us to use less energy.
For example, if we know that certain applications really consume a lot of power, those are the kind of applications we'd want to run in the evening, when the electricity is cheaper.
The BlueGene/P can vary in power [requirements] by almost a factor of three from idle to running really complex math operations. That's a pretty big difference; so being able to predict those swings is a big area of research for us.
Are there any of Argonne's energy saving techniques that can be applied to enterprise computing?
Pete: In fact, almost everything that's been described are things that data centres are looking at and beginning to try out.
How many nines of reliability you need does have an impact on the kinds of things you can try and the kinds of things you can do. For example, one of the things we do here to save power is we don't have a battery backed-up power supply for the entire machine; we only have one for the disk.
We think of supercomputers as a sort of time machine. What we mean is that a lot of the technology that we develop here for supercomputers is what will be in a high-end server five years from now, and in your desktop several years later.
An example is the move from air-cooling to water-cooling. We already know that our next generation [supercomputing] machine is going to be completely water-cooled, because that makes the most sense from a green perspective.
In the future, we're going to see standard data centres slowly shifting over to water-cooled servers. It seems pretty likely to us that, as we continue to have larger compute requirements, designing these more efficiently will be what the industry does.
Another time-machine concept in the Blue Gene/P architecture is by going with a slightly slower clock, and increasing the number of processors, we save a lot of power. We're already starting to see that in data centres now, where choosing the fastest, most powerful computer for each individual node is not the best power envelope.
Choosing one that is slightly slower, but having more of them, uses less electrical power. Data centres like Google, Amazon and Yahoo can take advantage of having lots of servers running at relatively lower power and this is another trend that we expect will continue from our space into the data centre space.
Jeff: The free cooling thing is clearly dependent on geography, but the concept of optimising chilled water temperature based on your actual demand is something that all data centres should be doing.
It's really an integration of IT professionals with engineers; it's understanding how the applications are running and how hard they are pushing the machine.