Biggest correction: cooling is not “the” biggest AI problem — it is the hidden multiplier
The headline is punchy, but it overstates the case:
“The biggest problem in AI isn’t computing power. It’s keeping the servers from overheating.”
Better:
“AI’s next bottleneck is not just compute. It is heat flux: how to remove extreme waste heat from dense GPU racks without burning power, water, and money on cooling.”
Why this is better: compute supply, grid connection, transformers, land, permitting, water, networking, chip packaging, memory bandwidth, and model efficiency are all constraints. Cooling is one of the most under-discussed constraints because it sits between the chip and the grid. It is not the only bottleneck, but it is the bottleneck that determines whether the other bottlenecks become usable infrastructure.
The International Energy Agency estimates data centers used around 415 TWh globally in 2024, about 1.5% of global electricity, and projects data-center electricity consumption could more than double to around 945 TWh by 2030, with AI as a major driver. In the U.S., Lawrence Berkeley National Laboratory reported that data centers consumed about 4.4% of U.S. electricity in 2023 and could reach 6.7% to 12% by 2028.
2. The missing technical distinction: energy use vs. heat removal
Every watt consumed by a server eventually becomes heat. Cooling does not make that heat disappear. Cooling only decides how expensively, how reliably, how water-intensively, and at what temperature that heat gets moved somewhere else.
That means the better conceptual line is:
AI’s energy problem has two layers: the watts used to compute, and the extra watts or water used to move the waste heat those computations create.
This avoids a common mistake in AI-energy discourse: pretending that better cooling eliminates the IT power itself. It usually does not. It can reduce cooling overhead, reduce thermal throttling, lower leakage losses, support higher sustained clocks, enable denser racks, reduce floor space, reduce water use, and improve uptime. But the GPU’s electrical power still becomes heat.
The genius framing:
Compute creates the heat. Cooling determines whether that heat becomes an infrastructure tax.
3. Add the heat path — this is the missing mental model
Most people imagine “cooling a data center” as fans blowing cold air around a room. That is obsolete for AI-scale density.
The post should explain the full thermal chain:
Chip junction → thermal interface material → package → cold plate / boiling surface → coolant → coolant distribution unit → facility loop → chiller / dry cooler / cooling tower → outside air
The weakest link controls the whole system.
The real fight is not “air vs. liquid.” It is thermal resistance across the stack. A brilliant cooling fluid is useless if the thermal interface material is bad. A cold plate is useless if the facility loop cannot reject heat. A perfect rack is useless if the site cannot get water, power, permits, or transformers.
Best line:
AI infrastructure is becoming a chip-to-grid thermodynamic machine. The GPU is only the hottest part of a much larger heat-removal system.
4. Add rack-density numbers
The post needs one hard number to make the urgency real.
For example, HPE’s NVIDIA GB200 NVL72 system page says each rack consumes 132 kW, with 115 kW liquid cooled and 17 kW air cooled. NVIDIA describes the GB200 NVL72 as a rack-scale, liquid-cooled design connecting 36 Grace CPUs and 72 Blackwell GPUs.
That is the “oh wow” moment. Traditional enterprise racks were often in the single-digit to low tens of kilowatts. AI racks are moving into industrial-machine territory.
Better post line:
A modern AI rack can draw power like a small building. At that density, air cooling stops being a comfort system and thermal management becomes industrial engineering.
5. Explain the nuclear-reactor connection properly
Do not just say “inspired by nuclear reactors.” That sounds like hype unless you explain the physics.
The correct explanation:
Nuclear engineers have spent decades learning how to move very high heat loads away from reactor fuel safely. One key concept is subcooled boiling. In ordinary saturated boiling, large vapor bubbles form, detach, and rise away. In subcooled boiling, the liquid near the hot surface boils locally, but the surrounding fluid is still cool enough that the bubbles detach quickly and recondense. This refreshes liquid at the hot surface and improves heat transfer.
Ferveret’s own technology page says its Adaptive Phase Cooling is inspired by subcooled boiling, producing smaller bubbles that detach more frequently and recondense in the surrounding liquid, refreshing liquid at the chip surface and improving heat transfer.
The deep engineering line:
The breakthrough is not “boiling liquid.” The breakthrough is controlling the bubble lifecycle before it becomes unstable.
6. Obscure but powerful concept: “critical heat flux”
This is the genius technical concept missing from the post.
In boiling heat transfer, there is a danger zone called critical heat flux. Below it, boiling helps remove heat. Above it, vapor can blanket the hot surface, reducing heat transfer and causing temperature to spike. In nuclear engineering, this is tied to dangerous boiling instability. In chip cooling, the stakes are different, but the physics rhyme: if the boiling process becomes unstable, the cooling surface can lose contact with liquid exactly when heat removal is most needed.
So the real question for a nuclear-inspired cooling startup is not simply:
“Can it boil?”
It is:
“How much margin does it maintain before critical heat flux, under real AI workload transients, pump faults, coolant aging, and partial blockage?”
That is the engineer’s question.
7. Best nuclear analogy: every AI rack becomes a tiny reactor-core problem
Do not say “AI data centers are like nuclear reactors” in a sensational way. Say this:
“The analogy is not radiation. The analogy is heat-flux control. A dense AI rack increasingly resembles a reactor-core thermal problem: enormous heat generation packed into a small volume, where failure is less about average temperature and more about local hotspots, fluid stability, and safety margins.”
That is both accurate and memorable.
8. The strongest one-line thesis
Use this:
“The future of AI may depend less on colder rooms and more on mastering microscopic bubbles at the surface of a chip.”
Or:
“The next AI infrastructure breakthrough may not be a bigger GPU. It may be a better way to boil liquid safely a few millimeters above one.”
9. The post needs to separate four cooling categories
Right now it lumps advanced cooling together. Add a simple taxonomy:
Air cooling: fans, heat sinks, CRAC/CRAH systems, hot/cold aisle containment. Mature, serviceable, but increasingly strained at high rack densities.
Direct-to-chip liquid cooling: coolant flows through cold plates attached to CPUs/GPUs. Strong near-term path for high-density AI, easier to adopt than full immersion.
Single-phase immersion: servers submerged in dielectric liquid that does not boil. Simpler fluid behavior, but large tanks, fluid handling, and serviceability can be difficult.
Two-phase / adaptive phase cooling: coolant changes phase or partially boils near hot surfaces, using latent heat and bubble dynamics to remove more heat with smaller temperature differences.
Then say:
Ferveret’s pitch is that it keeps some of the heat-transfer advantage of two-phase cooling while avoiding some of the complexity of traditional saturated-boiling immersion systems.
That sentence is much more credible than “nuclear cooling will save AI.”
10. Missing metrics: PUE is not enough
The post should not only talk about electricity. It should introduce the right scorecard.
PUE — Power Usage Effectiveness: total facility power divided by IT power. Lower is better; 1.0 is theoretical ideal.
WUE — Water Usage Effectiveness: water used per unit of IT energy.
CUE — Carbon Usage Effectiveness: carbon emissions associated with facility energy use.
ERF — Energy Reuse Factor: how much waste heat is productively reused.
Tokens per joule: how much useful AI output is produced per unit of energy.
Tokens per liter: useful AI output per unit of water consumed.
The most important new metric:
Useful intelligence per joule per liter per dollar.
That reframes the whole thing. The industry should not just optimize for more compute. It should optimize for useful output per constrained resource.
11. The current post slightly overclaims this line
“Every major improvement in cooling directly translates into lower electricity demand and lower emissions.”
Not always.
Better:
“Every major cooling improvement can reduce overhead, prevent throttling, improve hardware utilization, and lower water use. But emissions only fall if the saved electricity is not immediately offset by more compute demand, and if the power comes from lower-carbon sources.”
This matters because of the rebound effect. Better cooling may reduce energy per token while also making it economically attractive to deploy many more tokens. Efficiency can lower unit cost and increase total consumption.
The more sophisticated line:
Cooling efficiency reduces the cost of intelligence. Whether it reduces total emissions depends on what we do with the savings.
12. Missing water nuance: “waterless” needs a boundary
If the startup claims “zero water,” the post needs to clarify what that means.
Does it mean:
No water inside the server loop?
No evaporative cooling tower?
No on-site water consumption?
No water in the full electricity supply chain?
No water in chip fabrication?
No water at peak conditions?
No water only in certain climates?
This matters because “waterless cooling” can mean a closed-loop or dry heat-rejection architecture at the site, but the electricity generation behind the data center may still consume water depending on the grid mix. The best question is:
“Zero water where: at the chip, at the facility, or across the full lifecycle?”
That one line makes the post much sharper.
13. The hidden story is not cooling — it is site selection
Better cooling changes where data centers can be built.
If cooling needs less water, AI campuses can move closer to cheap solar, wind, geothermal, stranded power, retired industrial sites, cold climates, or places with available grid capacity. MIT News notes Ferveret’s founders argue water-free cooling could help data centers operate in dry regions with abundant solar energy.
That is a bigger implication than “servers do not melt.”
The deeper implication:
Cooling technology is becoming geography technology.
Whoever solves water-light, power-light, high-density cooling gets to choose better sites. That affects land economics, grid planning, sovereign AI, latency, national security, and local politics.
14. Obscure thought input: heat is the shadow price of intelligence
A very strong conceptual line:
Every AI token has a thermal shadow.
The visible output is text, image, code, video, or reasoning. The invisible output is heat. As models scale, the limiting question becomes:
How much intelligence can we produce before the heat, water, grid, and permitting costs dominate?
That is a more poetic and deeper version of the post.
15. Obscure thought input: the data center is becoming an exergy problem
Energy is conserved, but exergy is useful energy — energy available to do work.
Low-temperature waste heat from servers is hard to reuse. But if liquid cooling can produce warmer, more concentrated heat streams, that heat may become more useful for district heating, industrial preheating, greenhouses, absorption cooling, desalination preheat, or thermal storage.
The subtle point:
Better cooling is not just about removing heat. It is about upgrading waste heat into a usable thermal product.
A powerful line:
The next green data center will not just consume electricity. It will export compute and useful heat.
16. Genius-level solution: use a “thermal SCRAM”
Borrow from nuclear safety culture.
A reactor has emergency shutdown logic. AI racks need a version of that:
Thermal SCRAM: instant workload throttling, power capping, job migration, checkpointing, and coolant-loop isolation when sensors detect bubble instability, pump degradation, leak risk, abnormal pressure, or hotspot formation.
This should not be a human operator decision. It should be automated across the stack:
chip firmware,
GPU driver,
rack controller,
coolant distribution unit,
facility digital twin,
scheduler,
grid interface.
The key idea:
Cooling should not be passive plumbing. It should be part of the AI control plane.
17. Genius-level solution: make the scheduler thermal-aware
Today, AI scheduling mostly thinks about GPUs, memory, network topology, cost, and availability. Future schedulers should also think about:
rack inlet temperature,
coolant temperature,
pump efficiency,
local electricity price,
carbon intensity,
water constraints,
thermal headroom,
heat-reuse demand,
weather forecast,
grid congestion,
maintenance risk.
Then jobs can be placed based on thermal economics.
Example:
Train large models at night when ambient temperatures are low and renewable power is available.
Run inference-heavy workloads where latency matters.
Move flexible batch jobs to sites with cooler weather or lower-carbon power.
Throttle racks before cooling systems enter inefficient operating regions.
Send heat-heavy jobs to facilities with district-heating demand.
The brilliant line:
AI should learn where it is thermodynamically cheapest to think.
18. Genius-level solution: “tokens per thermal watt”
The AI industry needs a new KPI:
Tokens per thermal watt removed.
Not just tokens per GPU watt. Not just PUE. Not just utilization. The real metric should account for the full thermal burden of AI output.
A cloud provider could report:
tokens per kWh,
tokens per liter,
tokens per kg CO₂e,
tokens per dollar,
tokens per rack-hour,
tokens per degree of coolant temperature rise.
That would turn sustainability into an operational metric instead of a marketing claim.
19. Genius-level solution: standardize a “rack thermal passport”
Every high-density AI rack should ship with a thermal passport:
maximum heat load,
coolant flow requirements,
allowed inlet temperature,
thermal ramp rate,
failure-mode behavior,
pump ride-through time,
leak-detection requirements,
coolant compatibility,
service procedure,
expected PUE/WUE range,
heat-reuse temperature,
sensor map,
critical heat-flux margin,
warranty conditions.
This would let utilities, insurers, operators, and local governments evaluate AI infrastructure more intelligently.
Best line:
We should not permit megawatt AI rooms using laptop-era thermal disclosure.
20. Genius-level solution: thermal batteries for data centers
Cooling demand often peaks when the grid is stressed and ambient temperatures are high. Add thermal storage:
chilled-water tanks,
phase-change materials,
underground thermal storage,
ice storage where appropriate,
high-temperature liquid loops,
waste-heat storage for district use.
This lets data centers shift cooling loads away from grid peaks. The DOE has explicitly highlighted flexibility, onsite generation, storage, and grid-aware strategies as part of meeting rising data-center demand.
A great post line:
The next AI campus may need batteries for electrons and batteries for heat.
21. Genius-level solution: use AI to cool AI
This can sound cliché, so make it concrete.
AI cooling control should optimize:
fan speed,
pump speed,
valve positions,
coolant setpoints,
rack power caps,
job placement,
weather-dependent heat rejection,
chiller staging,
fault detection,
predictive maintenance,
leak-risk detection,
thermal anomaly detection.
This is not “AI magic.” It is control theory plus sensors plus operations data.
A strong line:
The cooling system should know the workload before the heat arrives.
22. The real product-market question
The technical question is not just: “Does it remove heat?”
The commercial question is:
Can it retrofit into existing data centers without turning operations into a maintenance nightmare?
Ferveret says its approach uses compact, rack-ready, server-level modules rather than large shared immersion tanks, and argues this improves integration and serviceability. That is important because data-center operators care about uptime, technician workflow, insurance, spare parts, warranties, and mean time to repair as much as raw cooling performance.
A great line:
The winning cooling technology will not be the one with the prettiest heat-transfer curve. It will be the one technicians can service at 3 a.m. without draining a swimming pool of exotic fluid.
23. Add the maintenance and reliability layer
The post should ask:
What happens when a pump fails?
What happens when coolant chemistry changes over five years?
What happens if microbubbles accumulate where they should not?
What happens during rapid workload spikes?
What happens during partial blockage?
Can a single server be removed without disturbing the rack?
Can the coolant damage cables, seals, thermal interface materials, plastics, adhesives, or labels?
What is the leak-detection method?
Is the coolant flammable?
What is the global warming potential?
Is it PFAS-linked?
What is the disposal process?
Can standard data-center technicians work on it?
This is where infrastructure hype becomes real engineering.
24. The coolant chemistry issue is a missing landmine
A lot of advanced cooling depends on specialized fluids. That raises questions about:
PFAS regulation,
global warming potential,
toxicity,
flammability,
dielectric strength,
material compatibility,
evaporation or fluid loss,
supply-chain availability,
disposal,
insurance,
worker exposure,
long-term degradation.
MIT News says Ferveret uses a low-boiling-point liquid with no toxic PFAS “forever chemicals,” and Ferveret’s own page says its fluid is low-GWP and not regulated under PFAS guidelines.
That is a major point to include because many people hear “liquid cooling” and immediately worry about leaks or exotic chemicals.
25. The better audience question
The current ending asks:
“Would you rather see more focus on making AI hardware more efficient, or on radically improving how we cool the massive data centers that run it?”
This is a false choice.
Better:
“Where should the next efficiency breakthrough come from: better models, better chips, better cooling, better scheduling, or better grid integration?”
Even better:
“What should AI optimize for next: more tokens per GPU, more tokens per watt, more tokens per liter, or more tokens per dollar?”
Best:
“If intelligence is becoming an industrial product, what is the right unit of efficiency: FLOPs, tokens, joules, liters, dollars, or carbon?”
26. Add a mini “cooling stack” graphic
The post would be much more shareable with a simple visual:
Level 1: Chip — lower-voltage architectures, chiplets, packaging, memory proximity.
Level 2: Server — thermal interface, cold plates, immersion, phase-change cooling.
Level 3: Rack — coolant distribution, leak detection, redundancy, serviceability.
Level 4: Facility — chillers, dry coolers, heat exchangers, thermal storage.
Level 5: Grid — carbon-aware scheduling, demand response, onsite energy.
Level 6: Society — water rights, permitting, community impact, heat reuse.
Caption:
AI cooling is not one technology. It is a stack.
27. Add a Sankey diagram idea
The best simple diagram:
1 MW into data center → 900 kW IT load → all becomes heat
100 kW cooling / pumps / fans / overhead → also becomes heat
Final output: useful AI tokens 1 MW of heat rejected somewhere
Then compare:
Old facility: PUE 1.4
Modern efficient facility: PUE 1.1
Advanced phase cooling target: PUE near 1.03, according to Ferveret’s reported UCLA testing.
The lesson:
The goal is not to eliminate heat. The goal is to make nearly every watt go to compute before it becomes heat.
28. The best “missing caveat” for credibility
Add this:
Cooling breakthroughs do not replace model efficiency. The greenest watt is still the watt the model never uses.
This protects the post from engineers who will say, correctly, that algorithmic efficiency, sparsity, quantization, distillation, memory architecture, and inference optimization matter more than cooling in many cases.
A very strong line:
Cooling is the tax code of AI infrastructure. Model efficiency is the income. You need both.
29. Better version of the post
Here is a stronger rewrite:
AI’s next bottleneck may not be raw compute. It may be heat flux.Every watt that enters an AI server eventually becomes heat. As GPU racks move from traditional data-center densities into 100 kW industrial-machine territory, the question is no longer just “Can we buy enough chips?”It is: Can we remove the heat without wasting electricity, water, space, and uptime?A startup called Ferveret, founded by MIT-linked nuclear engineering researchers, is attacking this problem with a cooling method inspired by nuclear reactor heat-transfer physics. Its Adaptive Phase Cooling approach uses subcooled boiling: tiny bubbles form at the hot chip surface, detach rapidly, and recondense in the surrounding liquid, refreshing the surface and pulling heat away more efficiently.That matters because air cooling is reaching its limits for dense AI workloads. Direct-to-chip liquid cooling is already becoming essential. The next frontier may be controlled two-phase cooling, where the phase change of a fluid removes much more heat with smaller temperature differences.But the real story is bigger than “servers are overheating.”AI infrastructure is becoming a chip-to-grid thermodynamic machine. The thermal path now runs from silicon, to coolant, to rack, to facility, to grid, to local water and climate constraints.The winners in AI infrastructure may not only be the companies with the fastest GPUs. They may be the companies that can turn the highest percentage of every watt into useful tokens before it becomes waste heat.The right metric is not just FLOPs. It is useful intelligence per joule, per liter, per dollar, and per square foot.The next AI breakthrough may not be a bigger model. It may be a better way to move heat.
30. More viral but still accurate version
The future of AI may depend on boiling liquid a few millimeters above a GPU.That sounds strange, but it is where AI infrastructure is heading.Modern AI racks can consume more than 100 kW. Almost all of that power eventually becomes heat. At that density, fans are no longer enough. Cooling becomes a first-order constraint on how much AI we can deploy, where we can deploy it, and how much water and electricity it
costs.Now companies are borrowing ideas from nuclear engineering, especially subcooled boiling, to remove heat faster and more efficiently from AI chips.The deeper point: AI is not just a software revolution. It is a thermodynamics problem.The next frontier is not only better chips. It is better heat removal, better water strategy, better grid integration, better scheduling, and better metrics for useful output per watt.We are entering the era of thermal intelligence.
31. Best short caption
AI is becoming heat-flux-limited. Every watt that enters a GPU becomes heat, and dense AI racks are pushing air cooling past its limits. The next infrastructure breakthrough may be controlled phase-change cooling, thermal-aware scheduling, and measuring AI not just in FLOPs — but in useful tokens per joule, liter, dollar, and square foot.
32. Best comment under the post
The key distinction: cooling does not eliminate the server’s energy use. It reduces the overhead and constraints of moving waste heat. The real metric should be useful AI output per joule, per liter, and per dollar — not just raw GPU count or PUE.
33. Killer questions to ask the startup
Ask these before accepting the hype:
Performance: What heat flux can the system handle in W/cm², not just kW per rack?
Benchmarking: Was the 15% efficiency improvement tested against the same GPUs, same workload, same ambient conditions, same power caps, and same facility assumptions?
Scope: Does the efficiency gain come from better cooling alone, from reduced throttling, from higher sustained clocks, or from power-control software?
Water: Does “zero water” mean no on-site water, no evaporative cooling, or no water across the full lifecycle?
Reliability: What happens during pump failure, coolant loss, partial blockage, or workload spikes?
Safety margin: What is the critical heat-flux margin under worst-case AI workloads?