Beyond accuracy: what responsible AI looks like in catastrophe modelling

AI is reshaping catastrophe modelling. In this Responsible AI column, Dr. Fei Huang explains why accuracy alone is not enough.

At this year's Catastrophe and Reinsurance Symposium (CARS) 2026, I joined Hannah Stringfellow and Dr. Jordan Brook, moderated by Dr. Phil Conway for a panel discussion on catastrophe (cat) modelling -- where it has come from, where AI is taking it and what that means in practice. One question raised on the panel was: what do we need to worry about even when a model performs well? This column reflects on that question in greater depth.

How is AI changing catastrophe modelling?

For decades, catastrophe (‘cat’) models have been built around a chain of physical reasoning. Each link can be traced back to established science and engineering and experts who can defend or contest them. The process is therefore complex but traceable.

AI-empowered cat models bring something different, with AI playing a range of roles depending on the application. For emerging events such as a developing cyclone, AI weather models like Google DeepMind’s GraphCast and GenCast, and ECMWF’s AIFS, have demonstrated superior skill on a range of standard verification metrics compared to traditional physics-based systems and do so in minutes rather than hours.¹ For national perils insurance pricing, AI can contribute in multiple ways, for example:

As a synthetic event generator capable of producing large ensembles of realistic extreme events consistent with historical climatology, at a much lower computational cost compared to traditional methods
As an emulator of the physics-based components within cat models, replicating their outputs at a greatly reduced runtime
As a downscaling tool, improving the spatial resolution of model inputs and outputs to levels more useful for pricing. At the property level, AI can process satellite and aerial imagery at scale to assess individual building attributes (construction material, vegetation proximity, roof condition) that human underwriters could never evaluate across a portfolio of millions.² Post-event, AI image analysis can distinguish total losses from partial damage within hours of a disaster, enabling faster claims response and more accurate early loss estimates.³ These are real advances in speed, granularity and pattern-recognition capacity.

The industry is right to take these advances seriously and benefit from them. But insurance and financial services are built on trust. That trust depends on being honest about what our models can and cannot do, and on assuring customers that decisions affecting them are fair, explainable and robust. Ensuring these factors comes with costs.

There is no free lunch. What are the risks?

It is worth examining what AI models give up, alongside the performance and efficiency gains.

Let's start with transparency and explainability. Traditional cat models are, at least in principle, inspectable. You can open the hazard module, examine its assumptions and challenge them on scientific grounds. AI models, however, learn statistical associations from data rather than representing explicit physical relationships in the way physics-based models do and the patterns they find do not always translate back into interpretable physical accounts.

The operational consequence of lost explainability is one I raised at the panel. Consider an AI vulnerability model estimating damage from a cyclone or flood event. It may appear to predict losses accurately in aggregate, but could be learning correlations between past claims patterns and postcode-level characteristics rather than genuine physical vulnerability. In catastrophe insurance pricing, the stakes are high. How confident are we that an AI pricing model is learning genuine physical risk rather than demographic proxies correlated with postcode? And are we confident that there are not unique geographic factors for a prediction that are not reflected in the training data? If we cannot answer these questions, we risk both violating regulations and eroding the consumer trust that the insurance industry depends on.

Bias is the second cost and it compounds with scale. Closely related is the question of fairness, meaning whether model outcomes are equitable across communities (conditional on legitimate risk factors) and free from discriminatory proxies. AI models trained on historical claims data inherit whatever patterns are embedded in that data, including patterns that may reflect past underinsurance, potential discriminatory access to coverage, or pricing that is associated with sensitive demographic characteristics. Better algorithms applied to biased data do not correct the bias. What makes this particularly hard to catch is that standard model validation focuses predominantly on aggregated metrics. A model that performs well on average can perform very poorly for specific communities and a portfolio-level metric will not show it.

Consider a flood pricing model that incorporates variables which can act as demographic proxies, for instance, payment frequency can correlate with socioeconomic status. An AI model might associate such a proxy with elevated flood risk, inflating premiums in lower-income areas even where the physical flood exposure is modest. In climate disaster insurance, the policyholders most exposed to physical risk are often also the most economically vulnerable. We need to make sure that data bias does not translate to compound harms.

The third cost is stability, whether too much or too little. Cat models, built around structured physical and statistical components, are better equipped than purely data-driven models to handle scenarios outside the historical record. AI models trained on historical data face these challenges more acutely, with no structural anchor to physical reasoning. Recent peer-reviewed research has found that AI weather models tend to underestimate the intensity of record-breaking extremes and that their temperature predictions can reflect climate conditions from 15 to 20 years earlier than the period being forecast, a direct consequence of anchoring to training data rather than physical principles.⁴

In a targeted test, when Category 3 to 5 tropical cyclones were removed from a model's training set, it could not accurately forecast Category 5 storms.⁵ In a world where the climate is actively shifting the distribution of extreme events, a model that struggles with conditions outside its training window is likely to fail precisely when it is needed most. A related question concerns adjustability. If climate projections shift the expected frequency or intensity of events beyond the historical baseline, can the model be recalibrated to reflect that, or is its behaviour locked to the period it was trained on?

What does responsible AI look like in catastrophe modelling?

None of this is an argument against AI in catastrophe modelling. The speed, the granularity and the pattern-recognition capacity are genuine strengths. But using them well is different from using them uncritically. Responsible AI is not a brake on innovation. It is what makes innovation sustainable.

What that looks like in practice starts with expanding model evaluation beyond the accuracy metric. Aggregate accuracy is necessary but not sufficient. A better validation framework asks whether the model performs consistently across different geographic communities and demographic groups, while acknowledging that for rare, high-severity events, this can be difficult and may require benchmarking against physics-based models rather than observed losses alone. It treats accuracy, transparency, stability, and fairness as distinct dimensions. Any one of them can fail while the others hold and each needs to be measured on its own terms. A model that performs well on accuracy but fails on stability is a liability in a changing climate. A model that performs well on accuracy but fails on fairness is a regulatory and ethical problem that can quickly become a reputational crisis.

In addition, our models should do more than transfer risk. They should generate risk awareness and risk mitigation signals that incentivise better behaviour. A model that can credibly say "install fire-resistant materials and your premium falls" gives a policyholder a lever and connects pricing to behaviour in a way that builds community resilience. This is what explainability enables. Insurance as a driver of resilience, not just a transfer of risk. Improving the explainability of AI models in these high-stakes domains is, in my view, one of the most important research directions the field can pursue.

How is Australia leading on resilience-based insurance pricing?

In 2024, NRMA Insurance and Suncorp announced premium discounts for households that use the Bushfire Resilience Rating app, a federally funded tool that assesses individual homes against local bushfire risk. The higher the resilience rating a household achieves, the larger the discount. Over 19,000 households have used the app, and more than 6,600 have taken at least four recommended actions, investing an estimated $44 million in home improvements.⁶ We should do more in this space.

The insurance industry has better data, better methods and better tools than ever. To embrace AI's capabilities and enable genuine innovation and productivity gain, we need to expand our thinking and our toolkits beyond accuracy along, towards metrics that reflect how models are used and matter for stakeholders. For an industry built on trust, responsible AI adoption is the only way to innovate safely, protect customers, and protect the brand.

So, when a model performs well in terms of accuracy, is that enough? No. Accuracy is where we start. Transparency, fairness, and stability are where we need to finish. Getting there is what allows AI to do its best work in catastrophe modelling. Not a blocker for innovation, but what makes innovation sustainable, trustworthy and safe.

References

Lam et al., "Learning Skillful Medium-Range Global Weather Forecasting," Science, 382(6677), 2023.
Naik, G., "Catastrophe Experts Tap AI to Tackle Soaring Insured Losses," Insurance Journal, March 26, 2025. https://www.insurancejournal.com/news/national/2025/03/26/817293.htm
Moody's RMS, "Catastrophe Modeling for a Resilient Future — Powered by AI," February 2026.
Landsberg, J.B. et al., "Forecasting the Future with Yesterday's Climate: Temperature Bias in AI Weather and Climate Models," Geophysical Research Letters, 2026. https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2025GL119740
Sun, Y.Q., Hassanzadeh, P., Zand, M., Chattopadhyay, A., Weare, J., and Abbot, D.S., "Can AI Weather Models Predict Out-of-Distribution Gray Swan Tropical Cyclones?" Proceedings of the National Academy of Sciences, 122(21), 2025.
Resilient Building Council, "Bushfire Resilience Rating App Delivers Insurance Savings," March 2024. https://rbcouncil.org/2024/03/21/government-funded-app-delivers-insurance-savings/