Dr. Roni Ramon-Gonen
Combining technical indices and content-based indices in cluster analysis
Customer loyalty is critical to the existence of the business. Achieving customer loyalty is a complex task. To accomplish this, the organization must know its customers, understand the customers’ needs, recognize their characteristics and understand how to act in order to satisfy their needs and raise the satisfaction level and loyalty.
Because customers differ from each other and have diverse needs, the organization cannot treat every customer in the same way. A marketing action that will affect one customer not necessarily will affect the other.
The assumption is that between customers there are sub groups with homogeneous needs and it is necessary to segment the customers and create a profile for each segment. The homogeneity level between customers in each segment should be reasonable in order to enable unified marketing methods within the segment. Each target segment must be large enough to allow profitable operations.
Segmentation of customers into groups that reflect business entities clearly from the perspective of the business is not an easy task and there are many ways to perform this task. Customer segmentation process, also called cluster analysis, begin with deciding which attributes to use in the data set in order to segment the customers, continue with choosing an appropriate segmentation method then finding a good method to assess the quality of the resulting segmentation and finally find a method to interpret the results.
The various stages have various limitations:
- When performing cluster analysis we don’t know in advance how to divide the customers into groups and which category each customer belongs. Cluster analysis, when the category to which the customer belongs is not known in advance is defined as unsupervised learning. There are clustering algorithms that recommend the number of clusters but each algorithm is different and reaches different results.
- There are many algorithms that perform customer segmentation. Each segmentation algorithm leads to different results, even the same algorithm with different parameters or different order of the input data leads to different results, making it difficult to select segmentation method.
- There are a large number of evaluation indices that try evaluate the segmentation quality; each index uses a different evaluation method and each index gives different results. For the same algorithm one evaluation method will determine that partition to ten clusters is optimal compared to other index that will determine the division to six clusters is optimal. These evaluation measures are technical; Most of them are based on measuring similarity between points in a cluster compared to the similarities/differences between clusters. Various indices specify different similarity criterion.
- Technical results provided by the standard evaluation indices do not provide insights that interest the content expert, which is interested in the clusters meaning. Namely diagnose the groups according to significant properties in his world.
- There are several tools that help analyze the clustering results and analyze the content of each cluster. These tools show the characteristics of each cluster, which features are most prominent in each cluster and what distinguishes a particular cluster from another cluster. Each tool displays the interpretation in different ways by different reports and different concepts. There are no standards and specific terms, and when you look at different tools you get different interpretations.
While technical indices evaluate the result of classification-division according to "technical" target function. For example, in cases where the classification is known, "technical" target functions can be based on the Confusion Matrix components, such as Precision, Recall and others; And in cases where no prior classification is known, "technical” target functions can be based on measures such as: Within Class Distance, Between Class Distance and others.
These technical measures do not address neither the nature of the variables nor the distribution of the variables values, and address only those “technical” functions. On the other hand, the content-based measures discussed in this study are based on the analysis of the variables values/content.
The content-based index in this paper is based on the salient values of the attributes, a phenomenon called in this work Total Saliency Factor (TSF), and based on a probability function (reflecting the saliency) taken from the idea of the Bounded Rationality Concept of Kahneman and Tversky (1979). Content-based index helps assess the quality of segmentation based on attribute values obtained in each cluster compared to the entire population attribute values. Content-based index aims to examine whether the cluster contains salient features that distinguishes the cluster from the total population.
In this work we combined technical indices with content-based indices in order to choose a limited number of segmentations that contribute to us out of a large number of divisions. Our goal was to reach a limited number of divisions that can be examined individually and finally decide what the optimal distribution for us as a business.
In this work we used a number of segmentation algorithms; we examined different evaluation indices to examine the segmentation quality and compared several methods of interpreting the segmentation results.
In this study we examined data containing customer transactions during the years
Y to Y +4. And from these data we tried to determine the optimal way to divide customers into groups according to their behavior in the context of their transactions. Our goal was to find a way to divide the population in a manner that will reflect the customer loyalty aspect.
Given that there are many tools on the market for cluster analysis, many algorithms for data segmentation, various methods for evaluation of the results and various methods for interpreting the results we examined a number of tools, a number of algorithms and compared the results. Each tool provides its own evaluation index so we decided to evaluate the quality of the divisions by combining the content-based evaluation index TSF (Total Saliency Factor) and the technical Index NPV (Negative Predictive Value). Our source data do not contain an attribute that represent customer loyalty. Therefore, as stated, we used cluster analysis techniques in order to segment the data to reflect as much as possible this feature. To do this, the data were divided into two groups: data from year Y to Y +3 and Y+4 data. Data from year Y to Y +3 was examined in 35 different divisions. We examined divisions between 9 to 15 clusters in several algorithms. And we evaluated our results using the "future" data from year Y +4. We evaluated the data using a combination of technical index NPV (based on the Confusion matrix elements) and content-based index, TSF.
The study shows that with the use of content-based indices we filter the range of results and converge to a limited number of results. The best result was obtained by MAV algorithm that build segments based on comparison of different algorithms divisions, Followed by the result of the Weka simpleKMeans algorithm with 14 clusters.
We saw the considerations for choosing segmentation by content expert and we saw that the content expert's choice combines good results in all indices examined.
We examined the issue of understanding and interpreting division results and giving meaning to a cluster and saw that there are different ways to define the characteristics of the cluster, and determine the salient features in each cluster.
We saw that at this stage there are differences between the tools and examined the advantages and disadvantages of each tool.