Comparing CRAH and Liquid Cooling Systems Failure Modes

2 0

High-density data centre facilities necessitate accurate, dependable cooling to avoid thermal runaway and equipment failure. Two popular approaches—Computer Room Air Handler (CRAH) units and liquid cooling systems—each provide distinct operational hazards. Understanding their failure modes is critical for maximising uptime, designing redundancy, and addressing the cooling requirements of AI, HPC, and hyperscale workloads.

Mechanical Complexity and Common CRAH Failures

CRAH systems cool air via chilled water coils and rely heavily on external chilled water plants and airflow design. The most common failure mode in CRAH environments involves airflow disruption, typically due to fan malfunction or filter clogging. These can cause hot spots to form rapidly in high-density racks.

Another common issue is valve or coil failure within the CRAH unit. The system may overcool or undercool certain areas if a control valve sticks or fails to modulate, leading to thermal imbalance. Since CRAH cooling depends on chilled water supplied from centralised plant systems, any upstream chiller or pump failure will directly compromise cooling availability. Additionally, if raised floor plenums or ducts are poorly maintained or obstructed, airflow can be severely restricted even if the CRAH unit itself remains operational.

CRAH units also face delays in restarting during power outage scenarios, especially if building management systems or chillers require manual resets. Although many facilities use backup generators, the time to restore chilled water circulation can extend beyond acceptable thermal thresholds, especially with rising rack power densities.

Liquid Cooling System Risks and Failure Points

Liquid cooling systems—whether direct-to-chip or rear-door heat exchangers—present a different set of risks. The most obvious is fluid leakage. While rare with proper installation and maintenance, a leak in high-density environments can cause immediate damage to IT equipment or result in costly shutdowns for containment and remediation.

Another failure mode involves pump malfunction. Liquid cooling systems depend on circulation pumps to deliver coolant to precise locations. Affected servers can overheat within seconds if a pump fails without a redundant loop in place. Additionally, air bubbles in the closed-loop system can impair thermal conductivity and lead to gradual temperature creep—often undetected until performance thresholds are breached.

Sensor calibration drift is another concern. Since liquid cooling systems often rely on embedded temperature and flow sensors, inaccurate readings can cause the system to overreact or underperform. Without regular recalibration, this may result in inefficient cooling or unnecessary shutdowns triggered by false alarms.

Moreover, integration challenges with building management systems (BMS) and data centre infrastructure management (DCIM) platforms can lead to blind spots in monitoring. Unlike CRAH systems, which are generally well-supported by legacy BMS infrastructure, liquid cooling systems often require specialised integrations that, if misconfigured, can delay incident response.

Speed and Severity of Failures

CRAH failures typically unfold over a longer timeline, offering facilities more time to respond—air temperature rises gradually, and alarms can be triggered before IT loads reach critical thresholds. In contrast, liquid cooling failures tend to be more abrupt. Pump loss, blockage, or leakage can cause rapid overheating with little warning, particularly in direct-to-chip setups where coolant is in immediate contact with heat-generating components.

Liquid cooling systems, therefore, demand stricter maintenance protocols and closer monitoring. CRAH systems, though older and more space-inefficient, offer greater fault tolerance through air mixing and redundancy at the room level. However, they are less effective in managing the escalating thermal loads from modern CPUs and GPUs without extensive overprovisioning.

Designing for Redundancy and Recovery

Both cooling strategies can be made fault-tolerant through proper design. CRAH-based facilities benefit from N+1 unit configurations, dual chilled water loops, and zone-based cooling separation. Meanwhile, liquid cooling systems require loop redundancy, pressure and leak monitoring systems, and emergency failover to air cooling in case of primary system failure.

Some high-density data centres now adopt a hybrid model—using CRAH for general thermal management and liquid cooling for hotspot mitigation. This approach allows operators to leverage the response buffer of air cooling while maintaining the precision of liquid systems.

Conclusion

In high-density environments, neither CRAH nor liquid cooling systems are immune to failure. CRAH provides slower failure development and easier maintenance, whereas liquid cooling has a higher risk but unparalleled thermal efficiency. The best way is often to carefully combine both systems, balancing their different risks with the criticality of the workloads they support.

Visit Canatec to upgrade your data centre’s cooling infrastructure for high-density loads.

Related Post