The dataset we use here is an e-commerce dataset.There are 8 columns and 541909 rows. We will only keep the InvoiceNo, Quantity, InvoiceDate, UnitPrice,CustomerID columns at the beginning. The Quantity and UnitPrice columns will also be dropped later after using them to calculate total amount. We try to figure out a single question. How do we segment our customers into different clusters and how to interpret that?
InĀ [30]:
# Packages import
import pandas as pd
import numpy as np
from tabulate import tabulate
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score,calinski_harabasz_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
InĀ [31]:
# import data
df = pd.read_csv("RFMdata.csv",encoding='latin-1')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# Original dataframe
# print(tabulate(df.head(2), headers='keys', tablefmt='html', numalign="center"))
Original dataframe
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850 | United Kingdom |
InĀ [32]:
# By looking at the below info, we see there are a lot of missing CustomerID. As we analyze customers, the rows with missing CustomerID
# need to be dropped. There are also some NA in Description column but we ignore them as we will drop the column anyways. We also drop the duplicates.
print(df.isna().sum())
df.dropna(subset=['CustomerID'], inplace=True)
df.drop_duplicates(inplace=True)
df.reset_index(drop=True,inplace=True)
InvoiceNo 0 StockCode 0 Description 1454 Quantity 0 InvoiceDate 0 UnitPrice 0 CustomerID 135080 Country 0 dtype: int64
InĀ [33]:
# Outliers and abnormal data check.
print('\n*** Describe Quantity & UnitPrice ***')
print(df[['Quantity','UnitPrice']].describe())
# The above description tells us that there is at least one negative number in Quantity column which shouldn't be. Let's check further.
print('\n\t\t\t\t\t\t*** Smallest 5 rows in Quantity ***')
print(df[df['Quantity']<=0].sort_values(by='Quantity').head(5))
print('\n\t\t\t\t\t\t*** Biggest 5 rows in Quantity ***')
print(df[df['Quantity']>0].sort_values(by='Quantity',ascending=False).head(5))
# From the printed info, we can tell that for rows 401131 & 401132, it might be a mistake and it was adjusted in the same day. Same for row 37511 & 37516
df.drop([401131,401132,37511,37516],inplace=True)
# Drop the rows with negative & 0 Quantity
df = df[df.Quantity>0]
*** Describe Quantity & UnitPrice *** Quantity UnitPrice count 401604.000000 401604.000000 mean 12.183273 3.474064 std 250.283037 69.764035 min -80995.000000 0.000000 25% 2.000000 1.250000 50% 5.000000 1.950000 75% 12.000000 3.750000 max 80995.000000 38970.000000 *** Smallest 5 rows in Quantity *** InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country 401132 C581484 23843 PAPER CRAFT , LITTLE BIRDIE -80995 12/9/2011 9:27 2.08 16446.0 United Kingdom 37516 C541433 23166 MEDIUM CERAMIC TOP STORAGE JAR -74215 1/18/2011 10:17 1.04 12346.0 United Kingdom 3041 C536757 84347 ROTATING SILVER ANGELS T-LIGHT HLDR -9360 12/2/2010 14:23 0.03 15838.0 United Kingdom 110949 C550456 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR -3114 4/18/2011 13:08 2.10 15749.0 United Kingdom 110948 C550456 21175 GIN + TONIC DIET METAL SIGN -2000 4/18/2011 13:08 1.85 15749.0 United Kingdom *** Biggest 5 rows in Quantity *** InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country 401131 581483 23843 PAPER CRAFT , LITTLE BIRDIE 80995 12/9/2011 9:15 2.08 16446.0 United Kingdom 37511 541431 23166 MEDIUM CERAMIC TOP STORAGE JAR 74215 1/18/2011 10:01 1.04 12346.0 United Kingdom 374208 578841 84826 ASSTD DESIGN 3D PAPER STICKERS 12540 11/25/2011 15:57 0.00 13256.0 United Kingdom 312064 573008 84077 WORLD WAR 2 GLIDERS ASSTD DESIGNS 4800 10/27/2011 12:26 0.21 12901.0 United Kingdom 145108 554868 22197 SMALL POPCORN HOLDER 4300 5/27/2011 10:52 0.72 13135.0 United Kingdom
InĀ [34]:
# Drop the columns not used in this case
df.drop(columns=['StockCode','Description','Country'],inplace=True)
# Create a new Total column and drop the two columns used to calculate the Total
df['Total'] = df.Quantity * df.UnitPrice
df.drop(columns=['Quantity','UnitPrice'],inplace=True)
# By looking at the data types, we can tell that we might need to change the data types of 'InvoiceDate' and 'CustomerID' columns.
print('\n*** Data types check ***')
print(df.dtypes)
# Change Invoicedate column data type to datetime then get the date part only
df.InvoiceDate = pd.to_datetime(df.InvoiceDate).apply(lambda x: x.date())
# Check if all CustomerIDs are int data type. If it is tue, change the data type.
print('\n*** Int, Float type check ***')
print('Are all the CustomerIDs Int data type? '+str(all(df.CustomerID%1==0)))
df.CustomerID = df.CustomerID.astype('int')
# Define that if customers don't have any transactions in 180 days, we classify them as churned customer and don't include them in the analysis.
LastDate = df['InvoiceDate'].max()
dfCut = df[df['InvoiceDate'].apply(lambda x: (LastDate-x).days<180)]
# Check the final dataframe.
# print(tabulate(dfCut.head(2), headers='keys', tablefmt='html', numalign="center"))
*** Data types check *** InvoiceNo object InvoiceDate object CustomerID float64 Total float64 dtype: object *** Int, Float type check *** Are all the CustomerIDs Int data type? True
Final dataframe
InvoiceNo | InvoiceDate | CustomerID | Total | |
---|---|---|---|---|
158496 | 556500 | 2011-06-13 00:00:00 | 13468 | 19.9 |
158497 | 556500 | 2011-06-13 00:00:00 | 13468 | 19.8 |
InĀ [35]:
# Recency
RFM = dfCut[['InvoiceDate','CustomerID']].groupby('CustomerID',as_index=False)['InvoiceDate'].apply(lambda x: (LastDate-x.max()).days)
# Frequency
Frequency = dfCut[['InvoiceNo','CustomerID']].groupby('CustomerID',as_index=False).size()
# Monetary
Monetary = df[['CustomerID','Total']].groupby('CustomerID',as_index=False).sum()
RFM = RFM.merge(Frequency).merge(Monetary)
RFM.rename(columns={'InvoiceDate':'Recency','size':'Frequency','Total':'Monetary'},inplace=True)
print(RFM.describe())
CustomerID Recency Frequency Monetary count 3477.000000 3477.000000 3477.000000 3477.000000 mean 15293.794363 48.365257 68.465056 2347.880930 std 1720.445547 45.318659 175.370319 9454.531044 min 12347.000000 0.000000 1.000000 0.000000 25% 13813.000000 14.000000 15.000000 382.980000 50% 15297.000000 32.000000 35.000000 855.910000 75% 16779.000000 71.000000 75.000000 2001.040000 max 18287.000000 179.000000 5391.000000 280206.020000
K-means gives the best result under the following conditions. Remember sequence is important because normalization creates negative values and Log transformation only works on positive data.:
- Dataās distribution is not skewed.
- Data is standardised (i.e. mean of 0 and standard deviation of 1).
InĀ [36]:
# Check skewness
RFMColumns = ['Recency','Frequency','Monetary']
RFMLog = RFM[RFMColumns]
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, ax in zip(RFMLog.columns.to_list(),axes):
sns.histplot(RFMLog[i], ax=ax, kde=True, stat="density", linewidth=0)
plt.show()
InĀ [37]:
# Remove skewness
RFMLogCopy = RFMLog.copy() # To avoid Handling SettingWithCopyWarning, use a copy of the dataframe
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, ax in zip(RFMLogCopy.columns.to_list(),axes):
small_const = 1e-10
RFMLogCopy[i] = pd.Series(np.log(RFMLogCopy[i]+small_const))
sns.histplot(RFMLogCopy[i], ax=ax, kde=True, stat="density", linewidth=0)
plt.show()
InĀ [38]:
# Standardize - Combining centering and scaling using a scaler
scaler = StandardScaler()
RFMSt = pd.DataFrame(scaler.fit_transform(RFMLog))
RFMSt.columns = RFMLog.columns.to_list()
print(RFMSt.describe().round(2))
Recency Frequency Monetary count 3477.00 3477.00 3477.00 mean -0.00 -0.00 -0.00 std 1.00 1.00 1.00 min -1.07 -0.38 -0.25 25% -0.76 -0.30 -0.21 50% -0.36 -0.19 -0.16 75% 0.50 0.04 -0.04 max 2.88 30.35 29.39
InĀ [39]:
# Use Elbow method,silhouette_score, calinski_harabasz_score to determine the optimal K.
Scores = ['Inertia','SHScores','CHScores']
Inertia = []
SHScores = []
CHScores = []
for k in range(2, 11):
model = KMeans(n_clusters=k, random_state=1)
model.fit(RFMSt)
pre = model.predict(RFMSt)
Inertia.append(model.inertia_)
SHScores.append(silhouette_score(RFMSt,pre))
CHScores.append(calinski_harabasz_score(RFMSt,pre))
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for Score, ax in zip(Scores,axes):
sns.pointplot(x=list(range(2,11)), y=globals()[Score], ax=ax)
ax.set_xlabel('K')
ax.set_title(f'{Score}')
plt.show()
InĀ [40]:
# Use scatterplot3D to check the clusters
model = KMeans(n_clusters=4,random_state=1231).fit(RFMSt)
RFM['Clusters'] = model.labels_
print(RFM.head())
categoryList = []
categoryDfList =[]
[categoryList.append('category'+str(i)) for i in range(4)]
for category in categoryList:
globals()[category+'Df'] = RFM.loc[RFM['Clusters']==int(category[-1])][RFMColumns]
categoryDfList.append(globals()[category+'Df'])
plt.figure(figsize=(15,15))
ax = plt.subplot(121, projection='3d')
colorList = ['y','r','g','b']
for i in range(len(categoryDfList)):
ax.scatter(categoryDfList[i][[RFMColumns[0]]],categoryDfList[i][[RFMColumns[1]]],categoryDfList[i][[RFMColumns[2]]],c=colorList[i])
plt.gca().set(title = '3D scatterplot for K=4 ', xlabel='Recency', ylabel='Frequency',zlabel='Monetary' )
# Format number in thousands using k.
new_ticks = [f'{int(tick/1000)}k' for tick in ax.get_zticks()]
ax.set_zticks(ax.get_zticks()) # Explicitly set the ticks to avoid warning
ax.set_zticklabels(new_ticks)
# Set X,Y,Z Axes box aspect
ax.set_box_aspect([1,1,1.11])
plt.show()
CustomerID Recency Frequency Monetary Clusters 0 12347 2 80 4310.00 0 1 12348 75 3 1797.24 2 2 12349 18 73 1757.55 0 3 12352 36 47 2506.04 0 4 12356 22 2 2811.43 0
InĀ [41]:
# Get the statistics for the clusters
ClustersInfo = RFM.groupby('Clusters').agg({'Recency':'mean','Frequency':'mean','Monetary':'mean','CustomerID':'sum'})
print(ClustersInfo)
Recency Frequency Monetary CustomerID Clusters 0 25.466327 75.916210 2384.101641 39068949 1 1.500000 4375.250000 70612.247500 59596 2 112.775711 25.803063 892.514980 13971209 3 8.400000 615.600000 195275.744000 76769
From the above statistics, We can briefly describe the different customer groups.
- Customers in cluster3 are our gold customers who spent most, bought quite frequently and made a purchase recently.
- Customers in cluster1 are our silver customers who spent quite a lot, bought most frequently and made a purchase recently.
- Customers in cluster0 are our casual customers who spent some money, purchased some and made the last purchase not long ago.
- Customers in cluster2 are customers about to churn who spent least money, purchased a bit and made the last purchase long time ago.