Customer Segmentation Report for Arvato Financial Services

7 min readMay 11, 2020

Project Overview

Companies nowadays can easily identify a certain set of customer within a market and work towards achieving it’s goals. When creating customer segmentation, you can easily compare it to the general population. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns.

In this project, I analyzed the demographic data for customers of a mail-order sales company in Germany. The unsupervised learning technique was used to reduce dimensionality and identify the important demographic features that may contribute to online purchase. The supervised learning, or Gradient Boosting Classifier, was used next to fit the training data. Parameters were tuned according to accuracy score. The tuned model was then used to predict the test data. Final prediction was submitted to Kaggle competition.

Problem statement

The problem here is that Arvato company needs to know the audiences that are more likely to become a customer to work towards achieving it’s goals in marketing. Thus, on this project, I am aiming to predict the audiences who are most likely to become a customer. I will approach this problem by first pre-processing the datasets. second, I will use unsupervised learning techniques to create clusters of customer and general population, and then identify the difference. Finally, I will predict whether or not a person will become a customer of Arvato or not.

Data Exploration

There are four data files associated with this project:

Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).

Descriptive statistics for the first few attributes of AZDIAS data set

Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).

Descriptive statistics for the first few attributes of CUSTOMERS data set

Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Data Pre-Processing

In Data Pre-Processing Stage Following Steps were performed :

Explore the Missing Values.

Drop the columns which have missing data more than 65%.

Drop the object fiend which have too many different items.
Drop the outlier columns (in terms of the proportion of values that are missing).
Find correlation matrix and identify columns to drop based on threshold limit.

With all null data now handled, we focus on getting objects/categorical variables to numbers via one hot encoding.
Re-encode categorical variables by creating dummy variables.
Re-encode and Drop some of the Mixed-Type Features for simplicity.
All missing values were replaced with the median using the Imputer method.
Perform feature scaling using Standard Scaler.
Create a cleaning function, to be used later for the customer demographics data.

Customer Segmentation using Unsupervised Machine Learning

Principal Component Analysis (PCA)

A PCA was performed as it is one of the most useful techniques in Exploratory Data Analysis that makes us understand the data without losing critical information from dataset while reducing dimensions.

With PCA we want to make our data has high variance. This way we do not lose critical information from dataset while reducing dimensions. Based on above chart we can see that at around 220 components, cumulative variance is still high.

Using k-means Clustering

Next I performed k-means clustering to determine the number of clusters to use. The k-means algorithm groups the dataset into a user-specified number of clusters(k). When using k-means clustering, a method called the elbow method could be used to determine the number of clusters. Since there is no clear elbow shown in the plot bellow, I decided to choose 10 clusters as it seems to be the most appropriate number of clusters.

Clustering scores for various # of clusters

To decide on number of clusters, I will try using elbow method, k-means clustering will be applied to the dataset and the average within-cluster distances from each point to their assigned cluster’s centroid will be used to decide on a number of clusters to keep.

Sklearn’s KMeans class will be used to perform k-means clustering on the PCA-transformed data.
Then, the average difference from each point to its assigned cluster’s center will be computed.
The above two steps will be performed for a 30 different cluster counts to see how the average distance decreases with an increasing number of clusters.
Once final number of clusters to use is selected, KMeans instance will be re-fit to perform the clustering operation.

Analyze Clustering Data

After clustering data based on demographics of the general population of Germany, which showed how the customer data for a mail-order sales company maps onto those demographic clusters, I compared the two cluster distributions to see where the strongest customer base for the company is.

Comparison of clusters between general population and customers

Metrics

For XGBRegressor, the training score keeps decreasing, and the testing score keeps increasing. While for the RandomForestRegressor, the training score remains constant at a high level and the cross-validation score is low. It is clear that the best model is XGBRegressor because the RandomForestRegressor overfits the data.

Supervised Learning Model

After the application of unsupervised learning methods to the demographics and customer data, now its time to use supervised training techniques to predict whether an individual would respond positively to a marketing campaign.

Data files:

Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Imbalanced classes:

The training data for the classification task consisted of 42,982 rows.

The response column has 42,430 negative responses (98.8%) and only 532 positive (1.2%). This is an incredibly imbalanced dataset and will affect the training of the classification model. This will also affect the choice of metrics.

1)Data preparation

Firstly, the dependent variable “RESPONSE” was extracted from Mailout_train.csv. Then the features from Mailout_train.csv were preprocessed using the steps mentioned in step 1. The independent variables (Xs) include the 20 PCA components and the clustering labels. The dependent variable (Y) is the RESPONSE column. Furthermore, data was split into train (80%) and test (20%) subsets.

2)Classifier evaluation

Several ensemble methods were tested with default parameters to choose the best classifier.

3) Modeling

I used the following modeling methods: Ada-boost, Gradient Boosting and XGboost. Gradient Boosting tends to have better performance than other methods, yet the final result was still not impressive. The score for Kaggle competition was 0.79908. This could be due to the unbalanced data. Also, some variables are reversely coded and have not cleaned thoroughly.

Improvements

To increase accuracy of the project we could follow following steps:

Drop some variables by using correlation coefficient matrix to avoid the collinearity issue.
Increase the threshold to drop rows and columns.
Use other algorithms and try more parameters in the GridSearch.

There are many ways to improve this project few of them are , Eg. there are other ways to preprocess the data: choose another threshold for dropping rows and columns, choose different transformations for the columns, apply MinMax Scaler instead of Standard Scaler, impute data in another way.

Improvement of supervised model can be tested by using PCA dimensionality reduction. We could also choose attributes that have the most difference in clustering for overrepresented and underrepresented data and build supervised model using only these attributes.

Conclusion

This real life demographic data provided by Arvato Financials, gave me the chance to create segmentation of customers and I was able able to identify key features that will help identify customers for the company.

Principal Components Analysis (PCA) and K-means clustering are used to create the demographic segmentation report to compare customer data and general population data.

During the supervised learning, due to highly imbalanced data, StratifiedKFold method is applied to split the train-test dataset, then the Gradient Boosting Classifier is used to predict customer’s responses based on demographical data. For mode details click here.

Finally, I would like to thank Udacity and Arvato for this real-world learning opportunity! I have learned a lot and I will continue my learning process and obtain the skills needed to become a data scientist.