|Your Navigation Path||Case Base
This Case was developed for an insurance company. In the course of enhanced customer relationship management, the insurance company investigated opportunities for direct marketing. A more difficult task was to predict churn in terms of a customer buying back his life insurancd. The particular challenge was the handling of time-stamped data. The key idea to solving the discovery task was to create TF/IDF features for changes of policies.
The data for this Case was provided by an insurance company. The database excerpt consists of 12 tables with 15 relations between them. The tables contain information about 217,586 policies and 163,745 customers. The contracts belong to five kinds of insurances: life insurance, pension insurance, health insurance, incapacitation insurance, and funds bounded insurance. Every policy consists on average of 2 components. The table of policies has 23 columns and 1,469,978 rows. The table of components has 31 columns and 2,194,825 rows. Three additional tables store details of the components. The policies and components tables are linked indirectly by an additional table. If all records referring to the same policy and component (but at a different status at different times) are counted as one, there are 533,175 components described in the database. Each policy and each component may be changed throughout the period of a policy. For every change of a policy or a component, there is an entry in the policy table with the new values of the features, a unique mutation number, a code representing the reason of change, and the date of change. This means that several rows of the policy table describe the history of the same policy. Each policy is on average changed 6 times, each component on average 4 times. The fist approaches and background knowledge could effectively select the policy table for data mining.
The goal was to predict churn, i.e. to predict policy termination before the end date. If those groups of customers or policies can be detected where the risk of churn is high, particular marketing actions can be pursued in order to keep the customers. Given this goal, it is not sufficient to model the distribution of churn or its overall likelihood, but policies or customers at risk should be actually identified. Then, the insurance salesmen can contact them. In addition, if the insurance company has to re-buy a contract, it needs estimates of the amount of re-buy costs in order to plan its portfolio. The primary challenge lies in the mapping of the raw data into a feature space which allows an algorithm to learn. The feature space should be smaller than the original space, but should still offer the distinctions between the two classes, early termination of the contract and continuation.
The key idea was to use new features for the changes in contracts that were generated in analogy to term frequency and inverse document frequency (TF/IDF). It is plausible that very frequent changes of a policy are an effect of the customer not being satisfied with the contract. The chosen excerpt from the raw data (about policies) was transformed into a frequency representation in order to condense the data space in an appropriate way. However, the frequencies of those changes that are common to all contracts were excluded, as they are likely to be related to a common cause, such as a change of law. A measure from information retrieval formulates exactly this: term frequency and inverse document frequency (TF/IDF). Term frequency here describes how often a particular attribute a_i of c_j, the contract or one of its components, has been changed within a policy.
Hanna Koepcke, University of Dortmund, Computer Science VIII; Email: firstname.lastname@example.org; Web: www-ai.cs.uni-dortmund.de