Nvidia Blackwell and the Future of Data Center Cooling

Zara Avatar

Nvidia has faced scrutiny this month because some servers with a whopping 72 Blackwell processors were overheating. The issue arose because some initial OEM deployments were not properly water-cooled, which Lenovo aggressively identified and mitigated with its Neptune warm water-cooling solutions.

As AI advances, we’ll need more highly dense, incredibly powerful AI processors, which suggests that air cooling in server rooms may become obsolete.

Let’s talk about Blackwell, water cooling, and why Lenovo’s Neptune solution stands out at the moment. We’ll close with my Product of the Week: Microsoft’s Windows 365 Link, which could be the missing link between PCs and terminals that could forever change desktop computing.

Blackwell

Blackwell is Nvidia’s premier, AI-focused GPU. When it was announced, it was so far over what most would have thought practical that it almost seemed more like a pipe dream than a solution. But it works, and there is nothing close to its class right now. However, it is massively dense in terms of technology and generates a lot of heat.

Some argue it is a potential ecological disaster. Don’t get me wrong, it does pull a lot of power and generate a tremendous amount of heat. But its performance is so high compared to the kind of load that you’d typically get with more conventional parts that it is relatively economical to run.

It’s like comparing a semi-truck with three trailers to a U-Haul van. Yes, the semi will get comparatively crappy gas mileage, but it will also hold more cargo than 10 U-Haul vans and use a lot less gas than those 10 vans, making it more ecologically friendly. The same is true of Blackwell. It is so far beyond its competition in terms of performance that its relatively high energy use is below what otherwise would be required for a competitive AI server.

But Blackwell chips do run hot, and most servers today are air-cooled. So, it shouldn’t be surprising that some Blackwell servers were configured with air cooling and those with 72 or more Blackwell processors on a rack overheated. While 72 Blackwells in a rack is unusual today, as AI advances, it will become more common, given Nvidia is currently the king of AI.

You can only go so far with air-cooled technology in terms of performance before you have to move to liquid cooling. While Nvidia did respond to this issue with a water-cooled rack specification that Dell is now using, Lenovo was way ahead of the curve with its Neptune water-cooling solution.

Lenovo Neptune

Lenovo was the first to realize this, mainly because it is currently the market leader in its class in terms of water cooling — a technology initially acquired from IBM, which has been doing water cooling for decades.

What is important with water cooling isn’t just the technology but the knowledge of how to deploy it safely. Mixing water and high-amperage electronics can be a disaster if you don’t know what you’re doing. As a result of the IBM server acquisition, Lenovo has decades of water cooling experience that it calls Neptune.

Given Nvidia has specified a water-cooled rack, what makes Neptune better? The answer is experience. Most that will use the Nvidia-specified solution, including Nvidia, don’t often deploy water-cooled solutions. As a result, particularly with these high-end Blackwell implementations, they’ll essentially be learning on the job.

It can be really dangerous when you mix water with high-amperage electronics. Water and electricity don’t mix. Not only can a leak fry an expensive part or even an entire rack, but if a person is present, it can fry them, too, if the breakers don’t set in. In a raised-floor environment, unless it has been designed with leaks in mind, terrible things can happen.

I observed this myself decades ago when I was at IBM, and it turned out they hadn’t stress-tested the water-cooling system for our massive (for the time) data center. The site lost a transformer that shut off the water-cooling system, which hadn’t been stress-tested for a sudden stop. The pipes burst, and the data center became a dangerous swimming pool. Most of the hardware, costing hundreds of millions of dollars, was lost, and the building was flooded, doing additional damage.

Through experiences like this, IBM became the leading OEM for safe water cooling, and Lenovo acquired that knowledge and experience when it bought the IBM x86 server group. Now, Lenovo, along with IBM, knows how to do water cooling better than most, which means that you can rest assured that a Lenovo Blackwell server won’t overheat or suddenly begin to leak.

Plus, Lenovo’s expertise is in warm water cooling, a far safer and far less expensive way to cool servers than cold water cooling, which requires huge, inefficient evaporators or chillers.

Implementing this technology is no trivial task. Unlike automobiles or PCs that are water-cooled, servers have to have hot swapping capabilities, which means you need exceptional and highly tested drip-free connections, aggressive alerting, preventive maintenance schedules based on past knowledge of components, and technicians experienced with working with this level of water-cooling tech.

Wrapping Up: A Future of Warm-Water-Cooled Data Centers

Blackwell is only the first of these incredibly powerful processors to hit the market because as AI pushes the envelope, Nvidia’s competitors will also have to push into something similar, suggesting all servers may eventually need to be warm water cooled.

That positions Lenovo nicely for a water-cooled future regardless of the technology while Lenovo’s competitors try to catch up. One benefit I expect techs to look forward to is the reduction in data center noise. The amount of air you have to push through air-cooled servers is massive and turns today’s data centers into a noise nightmare.

As warm-water cooling moves into the market more aggressively, these data centers will quiet down, making them far more pleasant places to work. That will make many of us who have to work in them very happy.

Tech Product of the Week

Windows 365 Link

Microsoft's Windows 365 Link Cloud PC device front, side and back views

Image Credit: Microsoft

Ever since we replaced terminals with PCs, IT has wanted the terminal experience back. Terminals were like pre-smart TVs in that you didn’t have to do patches or OS upgrades or deal with the “blue screen of death.” If the thing broke, it was pretty easy to fix or was relatively inexpensive to replace. From an IT perspective, terminals were a ton better than PCs.

But on the PC side, terminals sucked. You couldn’t run what you wanted to run without getting IT support, and it could take months for IT to respond to a request.

Terminals were connected to aging mainframes that couldn’t run modern applications at the time (they can now). New applications were usually custom-built, but a gap in communication between users and IT frequently led to problems. Users struggled to articulate their needs, and IT often failed to probe for better specifications, resulting in frequently unusable applications.

Well at Microsoft Ignite last week, Microsoft announced the Windows 365 Link which may be the closest thing to a perfect wired (there’s no laptop solution yet) terminal with PC-like features and performance.

While we call the class a thin client, Microsoft calls this a Cloud PC. At $349 and the size of a micro-PC, it appears to have the closest we’ve seen in terms of a near-perfect PC/terminal blend.

Windows 365 Link will be more reliable, cheaper, secure, and far smaller than most desktop PCs, making it very attractive for IT. At the same time, it connects to a Cloud PC instance, providing the user with a very PC-like experience.

It only targets enterprise accounts right now, mainly because they have the greatest need and the necessary infrastructure. I see this moving to markets like travel, education, government, manufacturing, and other vertical markets with similar needs. Although it doesn’t yet address mobile users, fully deployed 5G and the coming 6G specification should allow future mobile implementations.

Given Microsoft was one of the companies that launched the PC and made terminals obsolete, it seems ironic — and poetic — that Microsoft takes the lead in making them obsolete, eventually. We’ll see if that happens. For now, the Windows 365 Link is my Product of the Week.

Latest Posts