Claim your CPD points
Our models are often built on the high-quality data that we can see, but ignore valuable information that falls outside our usual data collection. Hugh unpacks this idea in his semi-regular Normal Deviance column.
Recently I was invited to a meeting of the Collaborative Partnership , which aims to improve work participation for people with physical or mental health conditions, by taking a cross-system view. There are some good resources there if you are interested in understanding the ways that different supports are given (through workers' compensation, insurance and welfare).
More broadly, the experience reminded me that analytics work is often very good at optimising the 'known universe' and often incredibly poor at providing insight outside that universe. For example, in workers' compensation, a large amount of detailed modelling has been done to understand how injured workers move through the compensation scheme, but there is much less known about what happens once their compensation payments stop. Do they return to work? Or drop out of the workforce and not return? Do they have extended spells on welfare benefits? Or rely on insurance payouts?
Analytics work is often very good at optimising the 'known universe' and often incredibly poor at providing insight outside that universe.
This type of knowledge gap occurs in all sorts of analytics contexts. For instance:
Such myopia creates an obvious opportunity to investigate the 'known unknowns'. While these are, by definition, harder to get right, there are some useful ideas that can help.
Often there are ways to attack the bigger question, for example through using targeted surveys or conducting other research. Such evidence will not be as cutting edge as detailed modelling of the 'known' systems, but will enable broader and better questions to be answered. This type of thinking may affect how analytics projects are prioritised.
In government, increasing use of data linkage is improving our understanding of how people move between systems, or potentially fall through the cracks. For corporates, data partnerships are possible to address similar sorts of cross-system questions, assuming appropriate privacy safeguards and consumer communication are in place. Vertical integration for companies can also achieve a broader customer view.
In the banking example given above, the risk of a customer leaving is significantly higher if they have accounts with other providers. A churn model that does not attempt to explore this risk is missing a trick.
While the true status of a customer outside known datasets may be unknowable, it can often be inferred probabilistically from what is known. In such cases, it's possible to meaningfully talk about the entire population despite the narrower scope of a dataset. For example, mortality rates attached to group life insurance must still be a legitimate subgroup of population-wide mortality, which is well-understood.
We may never completely solve cross-system gaps. However, by thinking through how pieces fit together, we can reduce the risk that a model carefully built on main company databases becomes quickly outdated as the external environment evolves.