It wasn’t that long ago that a rack full of equipment in a data center would draw 10,000 watts of power. That’s equivalent to ten 1,000-watt hair dryers running continuously.
If you grew up in a large family or have seen the film “Cheaper by the Dozen,” you know that powering all those dryers without tripping a breaker and finding ways to deal with the heat can be challenging.
Recent developments, time-tested solutions
Recently, NVIDIA announced the NVL72 dense GPU rack with an expected power draw of 120,000 to 160,000 watts. That’s between 120 and 160 hair dryers in a space about the size of a linen closet.
How long before the house burns down if the owner doesn’t deal with the heat?
The IT industry has had a solution for this for a very long time:
- IBM mainframes have required liquid cooling at various times in their history.
- Gaming computers regularly feature liquid-cooled CPU and GPU for reduced noise and lower operating temperatures.
- Submerging equipment in a non-conductive fluid has also been tested and deployed in several situations.
As you learn more about these efforts, you find that each solution has its own specific benefits and challenges that you need to be aware of or work with a partner that has experience and understands the various types of cooling options.
Air cooling: Best practices, distribution, and costs
Starting with air cooling, there is a practical limit to trying to cool more than approximately 30,000 watts (30kW) in a single rack. This requires careful attention to airflow in the data center.
Having sufficient airflow from vented tiles or cold air supply ducts is critical. Further, careful configuration of the rack so air from the hot side of the equipment doesn’t re-circulate to the front of the equipment is imperative.
These are “best practices” in data center design. And with the best designs, your racks are limited to 30kW, which allows only TWO of the latest eight GPU servers to be installed and powered per rack.
With air cooling, you’ll have to spread out your load over larger areas in a data center. As long as you have the power and space, this is not a problem. To function, it will likely increase costs for additional racks and power distribution, as well as necessitate longer network cables or switching from copper to optical cables to enable 100Gbps and faster network speeds.
Direct-to-Chip (DTC) cooling: Design, challenges, and component selection
With direct-to-chip or DTC liquid cooling, a server is designed with heatsinks for the few specific hot chips (CPU, GPU), with liquid flowing through them.
Water has excellent heat capacity, which means it can absorb large amounts of heat with small increases in temperature (as anyone who’s watched and waited for a pot to boil can attest). But water can be teeming with life and reacts with just about every material it comes in contact with. It easily becomes conductive with small amounts of added elemental ions.
Glycol added to the water in your car radiator to prevent biological growth is actually a salt that makes the water very conductive. It also lowers the heat capacity of water. But something must be used to prevent growth from clogging the system. If you’ve had to replace a water heater that dissolved without a sacrificial anode or got clogged with mineral deposits, you are aware of the practical problems with water cooling.
Selecting all components of a direct-to-chip cooling solution from a single vendor can help eliminate unexpected interactions between components. Ensuring no liquid ever leaks is nearly impossible, but paying for high-quality components can help. Using a system that operates with negative internal pressure is also a promising consideration.
Immersion cooling: Fluid interactions and server compatibility
With immersion, the most common fluids are various non-conductive oils. These can be mineral oils or lab-engineered hydrocarbons with useful properties. Like water, these fluids can interact with other substances in interesting ways. Plastics tend to become brittle when in contact with these oils.
Thermal paste between heatsinks and chips can also dissolve or change properties when immersed in oil. These oils also tend to have significantly less heat capacity compared to water, so heatsinks that can spread out the heat over a larger contact area may be necessary. If thermal paste contains conductive elements, the oil can transport them to other parts of the system, causing unintended shorts.
Like water, careful attention and testing of all components in contact with the oil is critical.
Given that the equipment is entirely submerged in fluid, it is necessary to select servers that are specifically engineered from the start to be immersed and not just retrofitted. This means all of the internal components, cables, and materials must be selected for compatibility with the fluid. The benefit of this effort is 100% heat capture and a largely silent operation. It can also enable operation in some particularly challenging thermal and environmental situations.
Keeping your data center cool
Ultimately, choosing the best cooling solution requires investing time in learning the details of each solution, or working with a company that has experience with all these technologies and will give you straight answers with no hype or agenda.
Penguin Solutions™ has been designing, building, deploying, and managing HPC infrastructure since 1998 and AI infrastructure since 2017, and has deployed and managed more than 75,000 GPUs. Penguin Solutions has achieved over 95% availability for Meta’s massive fleet of GPUs, integrated innovative immersion cooling technology for Shell, and recently deployed Georgia Tech’s NVIDIA-based AI supercomputer. Penguin Solutions provides assured infrastructure for critical, demanding AI workloads through its OriginAI® infrastructure solution. OriginAI simplifies AI deployment and management, maximizes GPU availability and utilization, and delivers predictable performance and optimal ROI by delivering proven, pre-defined AI infrastructure architectures integrated with validated technologies that are backed by Penguin’s intelligent, intuitive cluster management software and expert services. Our AI infrastructure solution experts can help you unlock your AI infrastructure’s full potential!