LLM feature engineering in insurance pricing

Insurance pricing is an area that is full of potential use cases for large language models (LLMs). AI tools will change how we design, build and validate pricing processes.

One area of particular interest is the application of LLMs in feature engineering. It’s a relatively nascent space despite the low barrier to entry.

Feature engineering in its common form, is simply adding columns (features) to a dataset to further describe each row (observation). If useful, the features can be incorporated into the pricing model, improving performance. These new features can be derived from existing features, from new information pulled from some external data source or some combination of the two. LLMs give us many new ways of generating such features, both by incorporating information held within the parameters of the LLM itself and also by leveraging unstructured data from external sources.

Given the risk of LLM hallucination, it may initially sound surprising to pursue incorporating LLM outputs directly in modelling. However, the natural validation processes built into model fitting offers significant protection from low-quality outputs.

Through offering additional data and improvements in predictive power to insurance pricing, they can shift risk from data scarcity and anti-selection to governance, bias and operational.

Types of LLMs derived features

In this section, I walk through four different ways LLMs can be used to generate new features for an insurance pricing model.

Factual descriptors: When modelling, it can be useful to enlist the help of a domain expert, e.g. an underwriter, to assign an ordinal risk grouping (e.g. low to high) across an insured’s characteristics. This is particularly useful when a field has many categories (e.g. car model in motor), which traditional models generally struggle to incorporate well. Getting bespoke ratings from a domain expert can be quite a time-consuming process and relies on their availability and consistency.

With LLMs, however, a rating process over thousands of levels that would usually take hours can be done in seconds levels. The questions you can ask the LLMs are also limited only by your imagination, opening up endless potential insights. Not only can you derive a similar risk category as you might get from a domain expert, but you can ask very pointed questions like what year the cigarette lighter was swapped out for a USB-C port?
Subjective descriptors: Along with this factual style information, we can also derive very subjective information reflecting broad views of society. Take the ability to classify 'boy racer' cars (as we say in Ireland), which are cars that are likely to be used in street racing and thus high risk to insure.

Which vehicle models constitutes such racing cars is not an explicit list but rather an ever-changing one. Before, you could have spent weeks to do some kind of sentiment analysis of domain-specific forums to get this kind of information which is now likely to fall straight out of an LLMs.
Interactions: LLM derived feature does not have to be a 1-1 from an existing feature but can look to categorise observations across a wide variety of features at once. The LLMs can create new flags for combinations of existing features that they regard as ‘high risk’. Thus LLMs give us in effect a new way of modelling interactions while removing a lot of the manual guesswork of where such interactions might lie.

This is particularly useful when the pricing model uses a Generalised Linear Model (GLM), where identifying and testing potential interactions is a time-consuming step. This can help close the gap seen between the predictiveness of GLMs and other machine learning models (such as tree-based models, which are effective at finding interactions), although it still likely won't outcompete them.
Unstructured external data: LLMs (or more accurately multimodal foundational models that can work with text and other types of inputs) can help us incorporate massive amounts of other data that previously would have been too onerous to incorporate. For example, an LLM using Google Street View images of properties, driven by a set of pointed prompts, can produce a wealth of new information for home insurance.

Not only could we estimate the number of bedrooms, but we could also ask:

What's the angle of the roof?
Are the tiles worn?
Are there plants growing in the gutters?
Is the garden well kept?
Do you think they prefer Harry Potter or Lord of the Rings?

A suburban home with a terracotta tile roof and colourful garden, with a visibly damaged car parked in the driveway — illustrating the kind of property and vehicle risk signals that multimodal AI models can identify from image data.

What can an LLM see that a traditional pricing model can't? A property image like this one contains a wealth of observable risk signals — from roof condition to what's parked in the driveway.

To be clear, I am not advocating the use of Google Street View for such purposes, but conceptually it represents an important advancement in using other data types.

When and how to implement

LLM features can be used across any form of modelling work. In insurance pricing, this usually spans customer behaviour modelling (conversion & retention models) and loss cost models (frequency, severity, burn cost). Incorporating LLMs still takes design and judgment; a natural initial set is brainstorming and consideration of the true risk factors. Take car insurance, true risk factors are things like ‘driving ability’, ‘propensity to take risks’, ‘likelihood to speed’, etc. This will help guide the prompting to ensure there are intuitive connections between new LLM features and our knowledge of risk.

So how do these ideas translate into something you’d be able to incorporate into a pricing model?

To create a simple LLM feature, you basically give the levels of a factor to an LLM and ask a pointed question, ideally with desired response levels prescribed in your prompt.

As a small example of feature engineering, see the example below, a table of car models. I asked Co-Pilot to attribute a Risk Score (Low to High), Boy Racer Likelihood (Very Low to Very High) and a Coolness score (1 – 10).

Table comparing four car models — Toyota Corolla, Honda Civic Type R, Ford Fiesta ST, and BMW 320d — across factual inputs (engine size, fuel type, doors) and LLM-derived features (risk score, boy racer likelihood, and coolness rating).

I now have in effect a mapping table for these LLM-derived features which I can use to map my modelling data with these new features. Note the ‘Medium-High’ level, which was not an option I gave to the LLM, so some of my own cleaning is likely beneficial here (or better prompting).

In most cases, you should be doing some form of mapping table creation. An API is useful where you have 1,000's of datapoints, but you should still use the result to create a static mapping table as this will be way more cost-effective than repeatedly running on new observations.

There are some cases where this is not practical. Like where there is high proportion of unseen new levels (e.g. address) in a production environment. Impacts on speed of algorithm and cost needs to be considered here and has the potential to offset any gains on pure predictiveness.

Considerations

There is a high probability that the LLM is wrong, and features are 'incorrect' in the narrow sense. You can, however, validate these new features’ predictiveness using traditional statistical methods, just as you would validate incorporating any new feature into a model. If the new feature does not improve performance, then it can be deprecated from the model through standard processes.

With this in mind, you do not need to spend too much time ‘fact-checking’ the features themselves. One key thing to check is whether the LLM as abided by your prompt restrictions. (see table above as example – creating new category ‘Medium-High’ reduces the ability to model). We do recognise that even if a new feature is beneficial overall, it still has the potential to incorrectly classify an individual row, meaning individual policy prices could become more volatile.

While finding a predictive feature may prove relatively straightforward, you need to be extremely careful to avoid discrimination when taking this approach. For example, letting an LLM derive features based on someone's name or other personal information is a recipe for disaster.

Considerations of fairness are further complicated by biases inherent in an LLM. Take our example of car models again. Generating features that depend on stereotypes (e.g. perceived propensity to speed). One might be tempted to create a feature to infer the ‘type of person likely to drive such a car’ or try to capture a true risk factor like ‘propensity to speed’ or simply ‘driving ability’. It is to be expected that unhelpful stereotypes or known differences of protected classes have significantly biased any LLM we use to derive a feature.

Thorough consideration and investigation of any LLM-derived factors is imperative when determining the ethical and legal implications of using such factors in rating.

Potential industry implications

These types of advances will have important long-term effects on pricing.

The opportunities flagged in this article suggest that the unstoppable trend of increased pricing sophistication will continue. Pricing segmentation and accuracy will improve and insurance for some segments of the population will become less affordable. Calls for further insurance pricing regulation to address the affordability of higher-risk policies will likely grow louder.

The reliance on bespoke data providers will potentially reduce as internal teams, with the aid of their LLM sidekick, become empowered to access a new wealth of insights previously unavailable.

The collision of LLM outputs with fair pricing considerations and indirect discrimination will become a serious operational risk. Opaqueness of LLM derived features and complex inherent bias compound such risks.

And, I hope, GLMs have been extended a lifeline, at least for now.