Segment customers using K-means

The dataset we use here is an e-commerce dataset.There are 8 columns and 541909 rows. We will only keep the InvoiceNo, Quantity, InvoiceDate, UnitPrice,CustomerID columns at the beginning. The Quantity and UnitPrice columns will also be dropped later after using them to calculate total amount. We try to figure out a single question. How do we segment our customers into different clusters and how to interpret that?

In [30]:

# Packages import
import pandas as pd
import numpy as np
from tabulate import tabulate
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score,calinski_harabasz_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

In [31]:

# import data
df = pd.read_csv("RFMdata.csv",encoding='latin-1')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# Original dataframe
# print(tabulate(df.head(2), headers='keys', tablefmt='html', numalign="center"))

Original dataframe

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850	United Kingdom

In [32]:

# By looking at the below info, we see there are a lot of missing CustomerID. As we analyze customers, the rows with missing CustomerID 
# need to be dropped. There are also some NA in Description column but we ignore them as we will drop the column anyways. We also drop the duplicates.
print(df.isna().sum())
df.dropna(subset=['CustomerID'], inplace=True)
df.drop_duplicates(inplace=True)
df.reset_index(drop=True,inplace=True)

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [33]:

# Outliers and abnormal data check.
print('\n*** Describe Quantity & UnitPrice ***')
print(df[['Quantity','UnitPrice']].describe())
# The above description tells us that there is at least one negative number in Quantity column which shouldn't be. Let's check further.
print('\n\t\t\t\t\t\t*** Smallest 5 rows in Quantity ***')
print(df[df['Quantity']<=0].sort_values(by='Quantity').head(5))
print('\n\t\t\t\t\t\t*** Biggest 5 rows in Quantity ***')
print(df[df['Quantity']>0].sort_values(by='Quantity',ascending=False).head(5))
# From the printed info, we can tell that for rows 401131 & 401132, it might be a mistake and it was adjusted in the same day. Same for row 37511 & 37516 
df.drop([401131,401132,37511,37516],inplace=True)
# Drop the rows with negative & 0 Quantity
df = df[df.Quantity>0]

*** Describe Quantity & UnitPrice ***
            Quantity      UnitPrice
count  401604.000000  401604.000000
mean       12.183273       3.474064
std       250.283037      69.764035
min    -80995.000000       0.000000
25%         2.000000       1.250000
50%         5.000000       1.950000
75%        12.000000       3.750000
max     80995.000000   38970.000000

						*** Smallest 5 rows in Quantity ***
       InvoiceNo StockCode                          Description  Quantity      InvoiceDate  UnitPrice  CustomerID         Country
401132   C581484     23843          PAPER CRAFT , LITTLE BIRDIE    -80995   12/9/2011 9:27       2.08     16446.0  United Kingdom
37516    C541433     23166       MEDIUM CERAMIC TOP STORAGE JAR    -74215  1/18/2011 10:17       1.04     12346.0  United Kingdom
3041     C536757     84347  ROTATING SILVER ANGELS T-LIGHT HLDR     -9360  12/2/2010 14:23       0.03     15838.0  United Kingdom
110949   C550456     21108   FAIRY CAKE FLANNEL ASSORTED COLOUR     -3114  4/18/2011 13:08       2.10     15749.0  United Kingdom
110948   C550456     21175          GIN + TONIC DIET METAL SIGN     -2000  4/18/2011 13:08       1.85     15749.0  United Kingdom

						*** Biggest 5 rows in Quantity ***
       InvoiceNo StockCode                        Description  Quantity       InvoiceDate  UnitPrice  CustomerID         Country
401131    581483     23843        PAPER CRAFT , LITTLE BIRDIE     80995    12/9/2011 9:15       2.08     16446.0  United Kingdom
37511     541431     23166     MEDIUM CERAMIC TOP STORAGE JAR     74215   1/18/2011 10:01       1.04     12346.0  United Kingdom
374208    578841     84826     ASSTD DESIGN 3D PAPER STICKERS     12540  11/25/2011 15:57       0.00     13256.0  United Kingdom
312064    573008     84077  WORLD WAR 2 GLIDERS ASSTD DESIGNS      4800  10/27/2011 12:26       0.21     12901.0  United Kingdom
145108    554868     22197               SMALL POPCORN HOLDER      4300   5/27/2011 10:52       0.72     13135.0  United Kingdom

In [34]:

# Drop the columns not used in this case
df.drop(columns=['StockCode','Description','Country'],inplace=True)
# Create a new Total column and drop the two columns used to calculate the Total
df['Total'] = df.Quantity * df.UnitPrice
df.drop(columns=['Quantity','UnitPrice'],inplace=True)
# By looking at the data types, we can tell that we might need to change the data types of 'InvoiceDate' and 'CustomerID' columns.
print('\n*** Data types check ***')
print(df.dtypes)
# Change Invoicedate column data type to datetime then get the date part only
df.InvoiceDate = pd.to_datetime(df.InvoiceDate).apply(lambda x: x.date())
# Check if all CustomerIDs are int data type. If it is tue, change the data type.
print('\n*** Int, Float type check ***')
print('Are all the CustomerIDs Int data type?  '+str(all(df.CustomerID%1==0)))
df.CustomerID = df.CustomerID.astype('int')
# Define that if customers don't have any transactions in 180 days, we classify them as churned customer and don't include them in the analysis.
LastDate = df['InvoiceDate'].max()
dfCut = df[df['InvoiceDate'].apply(lambda x: (LastDate-x).days<180)]
# Check the final dataframe.
# print(tabulate(dfCut.head(2), headers='keys', tablefmt='html', numalign="center"))

*** Data types check ***
InvoiceNo       object
InvoiceDate     object
CustomerID     float64
Total          float64
dtype: object

*** Int, Float type check ***
Are all the CustomerIDs Int data type?  True

Final dataframe

	InvoiceNo	InvoiceDate	CustomerID	Total
158496	556500	2011-06-13 00:00:00	13468	19.9
158497	556500	2011-06-13 00:00:00	13468	19.8

In [35]:

# Recency
RFM = dfCut[['InvoiceDate','CustomerID']].groupby('CustomerID',as_index=False)['InvoiceDate'].apply(lambda x: (LastDate-x.max()).days)
# Frequency 
Frequency = dfCut[['InvoiceNo','CustomerID']].groupby('CustomerID',as_index=False).size()
# Monetary
Monetary = df[['CustomerID','Total']].groupby('CustomerID',as_index=False).sum()
RFM = RFM.merge(Frequency).merge(Monetary)
RFM.rename(columns={'InvoiceDate':'Recency','size':'Frequency','Total':'Monetary'},inplace=True)
print(RFM.describe())

         CustomerID      Recency    Frequency       Monetary
count   3477.000000  3477.000000  3477.000000    3477.000000
mean   15293.794363    48.365257    68.465056    2347.880930
std     1720.445547    45.318659   175.370319    9454.531044
min    12347.000000     0.000000     1.000000       0.000000
25%    13813.000000    14.000000    15.000000     382.980000
50%    15297.000000    32.000000    35.000000     855.910000
75%    16779.000000    71.000000    75.000000    2001.040000
max    18287.000000   179.000000  5391.000000  280206.020000

K-means gives the best result under the following conditions. Remember sequence is important because normalization creates negative values and Log transformation only works on positive data.:

Data’s distribution is not skewed.
Data is standardised (i.e. mean of 0 and standard deviation of 1).

In [36]:

# Check skewness
RFMColumns = ['Recency','Frequency','Monetary']
RFMLog = RFM[RFMColumns]
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, ax in zip(RFMLog.columns.to_list(),axes):    
        sns.histplot(RFMLog[i], ax=ax, kde=True, stat="density", linewidth=0)  
plt.show()

No description has been provided for this image

In [37]:

# Remove skewness
RFMLogCopy = RFMLog.copy()   # To avoid Handling SettingWithCopyWarning, use a copy of the dataframe
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, ax in zip(RFMLogCopy.columns.to_list(),axes):
    small_const = 1e-10
    RFMLogCopy[i] = pd.Series(np.log(RFMLogCopy[i]+small_const))
    sns.histplot(RFMLogCopy[i], ax=ax, kde=True, stat="density", linewidth=0)  
plt.show()

In [38]:

# Standardize - Combining centering and scaling using a scaler
scaler = StandardScaler()
RFMSt = pd.DataFrame(scaler.fit_transform(RFMLog))
RFMSt.columns = RFMLog.columns.to_list()
print(RFMSt.describe().round(2))

       Recency  Frequency  Monetary
count  3477.00    3477.00   3477.00
mean     -0.00      -0.00     -0.00
std       1.00       1.00      1.00
min      -1.07      -0.38     -0.25
25%      -0.76      -0.30     -0.21
50%      -0.36      -0.19     -0.16
75%       0.50       0.04     -0.04
max       2.88      30.35     29.39

In [39]:

# Use Elbow method,silhouette_score, calinski_harabasz_score to determine the optimal K.
Scores = ['Inertia','SHScores','CHScores']
Inertia = []
SHScores = []
CHScores = []
for k in range(2, 11):
    model = KMeans(n_clusters=k, random_state=1)
    model.fit(RFMSt)
    pre = model.predict(RFMSt)    
    Inertia.append(model.inertia_)
    SHScores.append(silhouette_score(RFMSt,pre))
    CHScores.append(calinski_harabasz_score(RFMSt,pre))
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for Score, ax in zip(Scores,axes):    
        sns.pointplot(x=list(range(2,11)), y=globals()[Score], ax=ax)
        ax.set_xlabel('K')
        ax.set_title(f'{Score}')        
plt.show()

In [40]:

# Use scatterplot3D to check the clusters
model = KMeans(n_clusters=4,random_state=1231).fit(RFMSt)
RFM['Clusters'] = model.labels_
print(RFM.head())
categoryList = []
categoryDfList =[]
[categoryList.append('category'+str(i)) for i in range(4)]
for category in categoryList:
    globals()[category+'Df'] = RFM.loc[RFM['Clusters']==int(category[-1])][RFMColumns]        
    categoryDfList.append(globals()[category+'Df'])
plt.figure(figsize=(15,15))
ax = plt.subplot(121, projection='3d')
colorList = ['y','r','g','b']
for i in range(len(categoryDfList)):    
    ax.scatter(categoryDfList[i][[RFMColumns[0]]],categoryDfList[i][[RFMColumns[1]]],categoryDfList[i][[RFMColumns[2]]],c=colorList[i])
plt.gca().set(title = '3D scatterplot for K=4 ', xlabel='Recency', ylabel='Frequency',zlabel='Monetary' )
# Format number in thousands using k.
new_ticks = [f'{int(tick/1000)}k' for tick in ax.get_zticks()]
ax.set_zticks(ax.get_zticks())  # Explicitly set the ticks to avoid warning
ax.set_zticklabels(new_ticks)
# Set X,Y,Z Axes box aspect
ax.set_box_aspect([1,1,1.11])  
plt.show()

   CustomerID  Recency  Frequency  Monetary  Clusters
0       12347        2         80   4310.00         0
1       12348       75          3   1797.24         2
2       12349       18         73   1757.55         0
3       12352       36         47   2506.04         0
4       12356       22          2   2811.43         0

In [41]:

# Get the statistics for the clusters
ClustersInfo = RFM.groupby('Clusters').agg({'Recency':'mean','Frequency':'mean','Monetary':'mean','CustomerID':'sum'})
print(ClustersInfo)

             Recency    Frequency       Monetary  CustomerID
Clusters                                                    
0          25.466327    75.916210    2384.101641    39068949
1           1.500000  4375.250000   70612.247500       59596
2         112.775711    25.803063     892.514980    13971209
3           8.400000   615.600000  195275.744000       76769

From the above statistics, We can briefly describe the different customer groups.

Customers in cluster3 are our gold customers who spent most, bought quite frequently and made a purchase recently.
Customers in cluster1 are our silver customers who spent quite a lot, bought most frequently and made a purchase recently.
Customers in cluster0 are our casual customers who spent some money, purchased some and made the last purchase not long ago.
Customers in cluster2 are customers about to churn who spent least money, purchased a bit and made the last purchase long time ago.