Extreme heat is not just a meteorology problem; it’s a public health deadline. When thermometers surge, people suffer—especially the most vulnerable in cities with aging power grids, crowded housing, or limited access to cooling. As climate change nudges heat waves toward longer durations and higher peaks, forecasts become lifelines: they guide hospital preparations, energy management, and evacuation decisions. Yet traditional weather models have fought to keep up with the heat, particularly when you look more than a week or two ahead. That’s where a new breed of forecasting enters the room: AI-powered weather prediction models that learn from data to predict the next days and weeks of weather without being tied to explicit physical equations alone.
In a recent study led by Kelsey E. Ennis at Colorado State University’s Department of Atmospheric Science, researchers put two such AI-based models—GraphCast and Pangu-Weather—head-to-head against NOAA’s established Global Ensemble Forecast System (GEFS) for heat waves across the contiguous United States. The goal wasn’t to replace physics-based forecasting but to test whether learning-based systems can bridge gaps in predicting extreme heat at medium-range and subseasonal-to-seasonal timescales (up to about 20 days). The findings are cautiously hopeful: GraphCast, in particular, showed performance that could rival or even surpass a traditional forecast in many scenarios, especially when it came to larger regions and longer lead times.
What makes this conversation exciting isn’t just the numbers. It’s a glimpse into a future where humans and machines collaborate to anticipate events that shape health, energy demand, and everyday life. The CSU team’s work anchors a broader shift: AI systems catching patterns in sprawling data sets—patterns that emerge from oceans, soils, winds, and city heat islands—while still acknowledging their limits. The study does not claim a revolution in weather prediction. It argues for a measured optimism: AI-based forecasts can be a real boost for predictability of extreme heat, if used with care and rigorous verification.
What AI Weather Models Do Differently
Traditional numerical weather prediction (NWP) models run on the laws of physics, solving equations that describe the atmosphere in near-continuous time. It’s a monumental, physics-bound machine. The new AIWP (artificial intelligence-based weather prediction) models take a different path. They learn from past data—thousands of days of atmospheric states, temperatures, winds—and try to forecast the future by recognizing patterns that human analysts might miss. In practice, GraphCast uses a graph neural network—an approach adept at modeling connected systems like weather stations and grid points across the globe—while Pangu-Weather relies on a 3D Earth-specific transformer, a kind of attention-based engine that can weigh how conditions at one location relate to another over space and time. Both are trained on historical reanalysis data, ERA5, and thus learn from decades of atmospheric behavior.
These models aren’t merely “black boxes.” They are autoregressive: they predict a state, feed that prediction back in, and predict again, rolling out forecasts day by day. That’s how they extend from hours to days to nearly two weeks. In this study, the researchers fed GraphCast and Pangu a training diet spanning 1979–2017 and then tested them on 2018–2023 data, comparing their surface temperature forecasts against ERA5, the gold standard used for verification. They also included NOAA’s UFS GEFS as a traditional benchmark to see where AI could actually add value.
The models didn’t just spit out single numbers. The researchers evaluated how well the forecasts captured heat waves defined by regional thresholds—specifically, days when surface temperatures topped the 95th percentile for at least three days across large swaths of territory. In other words, this is not about predicting a single hot day, but about how well a forecast can anticipate a broad, dangerous heat event as it evolves. That nuance matters because heat waves are not uniform; they unfold over regions with complex terrain, coastlines, and urban systems.
One more layer of context: the study’s aim was regional and seasonal skill, not a dramatic, one-off win in a single city. With four regions in the contiguous United States and four seasons, the dataset captures a wide variety of heat-wave flavors—from the Pacific Northwest’s many rain-shadowed hills to the Southeast’s flat expanses and near-coast humidity. The complexity of these patterns—topography, soil moisture feedbacks, ocean influences, and urban heat islands—means any forecast system must generalize across a broad canvas.
Results: AI Models Show Promise, But with Biases
When the CSU team dug into the numbers, GraphCast repeatedly stood out as the more skillful AI model. Across the four seasons and four regions, GraphCast tended to produce smaller regional errors than the other AI model (Pangu) and, in many instances, outperformed the NOAA GEFS, particularly at longer lead times. In a representative look at a Northwest heat wave in 2019, GraphCast’s forecast errors stayed relatively low for much of the event, while Pangu showed larger swings and a more erratic error pattern. The NOAA GEFS, meanwhile, tracked the heat wave reasonably well in some phases but struggled at others, especially in the early onset or late decay phases in some locations.
Two case studies highlighted how the models behave differently as heat waves emerge and peak. The August 2011 Southeast heat wave and the September 2019 Northwest heat wave revealed a nuanced picture: UFS GEFS could capture the broad sweep of heat early on, but GraphCast often matched or exceeded its performance as the event unfolded. Pangu, by contrast, frequently exhibited a stronger bias, particularly a cool bias during the onset and in the early stages of several events. In one Southeast example, GraphCast outshone Pangu across the board and even narrowed the gap with UFS GEFS by late in the event. In the Northwest, GraphCast tended to be more stable, while Pangu’s errors fluctuated more dramatically.
Seasonal patterns reinforced a recurring theme: both GraphCast and Pangu tended to show a cold bias before and during heat waves in most seasons, a telling reminder that AI models still struggle to reproduce certain atmospheric thermodynamics and land-atmosphere feedbacks precisely when heat stress starts to ramp up. There were notable exceptions, though. Winter results showed that Pangu could exhibit a warm bias before heat-wave onset, suggesting different failure modes tied to the season’s unique background state. GraphCast, meanwhile, stayed more consistently conservative, with small cold biases that were often outweighed by its better performance during the peak heat period.
Beyond averages, the study mapped errors across grids to show where terrain and geography matter. Western regions with rugged topography, such as the higher elevations of the Southeast’s mountains or parts of the Northwest, produced larger challenges for all models, especially the NOAA GEFS, likely because coarse resolution struggles with complex terrain. GraphCast, with its graph-based approach, tended to do better in some of these areas but still showed biases tied to terrain and local features. The result is a candid portrait: the AI models can reduce some errors, but they are not a silver bullet for every location or season.
When the authors stitched together the full four-season, four-region picture, GraphCast emerged as the most consistently reliable among the AIWP models, especially for longer lead times and broader regional assessments. Pangu, while impressive in its architecture, showed more variability and larger biases in several cases. The study also found that forecasting skill tended to improve in the testing period (2018–2023) relative to the training period (1979–2017), suggesting that the models were learning to generalize better to more recent climate and atmospheric behavior. That’s an encouraging sign for real-world applicability, even as it underlines the need for ongoing evaluation as data and climate conditions evolve.
Crucially, the researchers did not treat these AI forecasts as definitive replacements for physical models. Instead, they framed AIWP models as complementary tools that can augment traditional forecasts, especially for the medium-range and S2S time horizons where heat waves pose the greatest risk and where skill has historically lagged. The study’s cautious optimism—AIWP models can catch big patterns and extend useful lead times, but require careful verification and ensemble use—reflects a mature stance on this rapidly advancing technology.
Why This Matters for Real-World Weather and the Climate Future
The most immediate implication of the CSU work is pragmatic: if AIWP models like GraphCast can reliably forecast heat waves with competitive skill several days to a couple of weeks out, emergency managers, power companies, water managers, and health agencies gain a valuable extra margin of safety. A 2–3°C shift in forecast accuracy over a large region might translate into hours or even days of additional warning for hospital staffing, cooling-center operations, or grid optimization. In a country already grappling with the stresses of summer heat, those margins can mean lives saved and resources allocated more efficiently.
But there’s a deeper, more philosophical shift at play. AI-based forecasts are emerging as a new kind of partner in weather prediction—one that learns from history, recognizes emergent patterns, and can scale their insights across vast regions with relatively modest computational cost. This is particularly valuable for heat waves, which are driven by a tapestry of drivers—from large-scale circulation patterns to local soil moisture feedbacks and urban heat islands. AIWP models have the potential to pick up those large-scale precursors and long-range tendencies in ways that can complement physics-based models that excel at capturing detailed dynamics but require intense computing.
That said, the study is careful not to overstate the case. The authors acknowledge persistent biases, especially the pre- and during-heat-wave cold biases in several seasons and regions, and they point to terrain and land-surface processes as possible culprits. The butterfly effect—how tiny differences in initial conditions can cascade into very different outcomes—remains a reality that AIWP models must respect. The researchers emphasize rigorous verification, cross-model ensembles, and ongoing comparison with traditional systems to avoid overconfidence. In practice, this means AIWP forecasts will likely live alongside, rather than replace, conventional forecasts, offering a complementary stream of information that forecasters can weigh.
One of the most compelling takeaways is the sense that these AI models are learning to forecast a difficult foe: the subseasonal evolution of extreme heat. If technology can increasingly offer reliable 10- to 20-day predictions, city planners can begin to stage interventions earlier, energy systems can diversify demand planning, and vulnerable communities can be better protected through targeted cooling access and public health messaging. The study’s careful, region-by-region, season-by-season analysis helps demystify where AIWP holds the most promise and where caution remains warranted.
From a scientific perspective, the CSU work also maps a path forward. The authors highlight several avenues for improvement: refining how AI models represent terrain and land-atmosphere coupling; experimenting with higher-resolution inputs to better resolve localized heat features; and integrating more diverse data streams, such as soil moisture and surface energy fluxes. They also underscore the importance of robust verification programs for AI-based forecasts, analogous to how meteorologists continually test physics-based tools against real-world events. This is not a finish line but a milestone on a longer road toward more reliable, interpretable forecasts that can adapt to a changing climate.
Finally, the human element remains essential. The CSU study is as much about building trust as it is about improving metrics. Forecasters, emergency managers, and the public need transparent information about when an AI forecast is likely to be reliable and when it should be treated with extra caution. The researchers’ clear, data-driven verdict—that GraphCast shows real promise for medium-range heat-wave prediction, with Pangu offering lessons in model diversity—gives practitioners a concrete starting point for experiments in operational settings. In this sense, AIWP is not a rival to human expertise but a tool to amplify it.
In a world where heat waves are not just more intense but also more unpredictable, every extra day of warning matters. The CSU study doesn’t promise a world where heat waves are perfectly forecast, but it sketches a future where AI-assisted forecasts could give communities the foresight they need to stay safer, cooler, and more prepared. If we think of forecast skill as weather’s “story arc,” GraphCast is helping us read the middle chapters with a sharper, more confident voice—and that could make all the difference when summer storms finally arrive.
University behind the work: Colorado State University, Department of Atmospheric Science. Lead author: Kelsey E. Ennis (corresponding author). The research evaluated two AIWP models—GraphCast and Pangu-Weather—against NOAA’s GEFS, across 60 heat-wave events in the continental United States, using ERA5 as truth data and focusing on lead times up to 20 days.
In the end, the study offers a measured forecast about the near future of weather prediction: AI-based models will likely become valuable partners for forecasting extreme heat, particularly at medium to longer leads, but they will need careful integration with traditional models, ongoing validation, and a clear understanding of where they shine best. The heat is on in more ways than one—both in the climate and in the science striving to predict it—and the conversation around AI in weather forecasting is just getting started.