The dataset we use here is an e-commerce dataset including 20 millions rows.There are 6 columns.User_id is a numeric user id.
Item_id is a numeric item id. Behavior_type is a single number representing different user activity on the website.1=view, 2=add to favorite,3=add to cart,4=purchase. User_geohash is hashed location data.Item_category is a numeric number representing item category. Time includes date and hour when each activity happened.
user_id | item_id | behavior_type | user_geohash | item_category | time | |
---|---|---|---|---|---|---|
0 | 10001082 | 285259775 | 1 | 97lk14c | 4076 | 2014-12-08 18 |
We try to figure out the below through this case.
- Trend of daily pages views and total users viewed the website
- What is the best time range in a day to do promotions?
- What is the overall convention rate and how it compare to industry standard?
- What are the users' activities paths?
- How to segment users?
InĀ [15]:
# Packages import
import pandas as pd
import datetime
import kaleido
import plotly.graph_objects as go
import plotly.express as px
import random
from plotly.subplots import make_subplots
from IPython.display import Image
from tabulate import tabulate
InĀ [16]:
# read data file
file_name = 'ECommerce.csv'
df = pd.read_csv(file_name)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# print original dataframe
# print(tabulate(df.head(2), headers='keys', tablefmt='html', numalign="center"))
Original dataframe
user_id | item_id | behavior_type | user_geohash | item_category | time | |
---|---|---|---|---|---|---|
0 | 10001082 | 285259775 | 1 | 97lk14c | 4076 | 2014-12-08 18 |
1 | 10001082 | 4368907 | 1 | nan | 5503 | 2014-12-12 12 |
InĀ [17]:
# duplicates check. As the time column in this dataset is to hour level only thus there are a lot of duplicate. We skip duplicates check.
# as we only need user_id, behavior_type and time columns this time, we drop the others.
df.drop(columns=['user_geohash','item_id','item_category'],inplace=True)
# null check. No null values found.
print('\n*** Null check ***')
print(df.isnull().sum())
# transform time column. As we want to do daily and hourly analysis, we separate it to date & hour columns. We can then drop the time column.
# The below method was used at the beginning. Since the dataset has more then 20 millions rows, it takes about 4 mins to finish the below step.
# df['date'] = df['time'].apply(lambda x: x[:10]).apply(lambda x: datetime.datetime.strptime(str(x),"%Y-%m-%d"))
# df['hour'] = df['time'].apply(lambda x:x[11:]).apply(lambda x:int(x))
# the vectorized approach is used for optimized performance.20 seconds processing time instead of 4 minutes.
df.time = pd.to_datetime(df.time)
df['date'] = df.time.dt.date
df['hour'] = df.time.dt.hour
df.drop(columns=['time'],inplace=True)
# Outliers, Abnormal data check. Use describe() to find if date, hour, behavior_type are in correct ranges.
# As it shows below, data are all good cos the correct ranges are: 1 < behavior_type < 4, 2014-11-18 < date < 2014-12-18, 0 < hour < 23.
print('\n\t\t*** Outliers check ***')
print((df[['behavior_type','date','hour']].describe(include='all',datetime_is_numeric=True).round(1)))
*** Null check *** user_id 0 behavior_type 0 time 0 dtype: int64 *** Outliers check *** behavior_type date hour count 23291027.0 23291027 23291027.0 unique NaN 31 NaN top NaN 2014-12-12 NaN freq NaN 1344980 NaN mean 1.1 NaN 14.9 std 0.5 NaN 6.4 min 1.0 NaN 0.0 25% 1.0 NaN 11.0 50% 1.0 NaN 16.0 75% 1.0 NaN 20.0 max 4.0 NaN 23.0
InĀ [18]:
# The cleaning and wrangling are finished. Let's check the final dataframe.
# print(tabulate(df.head(2), headers='keys', tablefmt='html', numalign="center"))
Final dataframe
user_id | behavior_type | date | hour | |
---|---|---|---|---|
0 | 10001082 | 1 | 2014-12-08 | 18 |
1 | 10001082 | 1 | 2014-12-12 | 12 |
InĀ [19]:
# Daily page views and total users visited the website
df['date'] = pd.to_datetime(df['date'])
DailyPV = df.groupby(['date']).size().reset_index(name = 'DailyPVcount')
DailyUV = df.groupby(['date','user_id']).size().groupby('date').size().reset_index(name = 'DailyUVcount')
# Hourly page views and users quantity who visited the website
HourlyPV = df.groupby(['hour']).size().reset_index(name = 'HourlyPVcount')
HourlyUV = df.groupby(['hour','user_id']).size().groupby('hour').size().reset_index(name = 'HourlyUVcount')
# Visualize
figDailyPVcount = px.scatter(DailyPV, x = 'date', y = 'DailyPVcount', trendline = 'ols')
figDailyUVcount = px.scatter(DailyUV, x = 'date', y = 'DailyUVcount', trendline = 'ols')
figHourlyPVcount = px.scatter(HourlyPV, x = 'hour', y = 'HourlyPVcount', trendline = 'ols')
figHourlyUVcount = px.scatter(HourlyUV, x = 'hour', y = 'HourlyUVcount', trendline = 'ols')
combined_fig = make_subplots(rows=2, cols=2, subplot_titles=('Daily Page Views', 'Daily Users','Hourly Page Views', 'Hourly Users'))
# Add each Plotly Express plot as a subplot
for trace in figDailyPVcount.data:
combined_fig.add_trace(trace, row=1, col=1)
for trace in figDailyUVcount.data:
combined_fig.add_trace(trace, row=1, col=2)
for trace in figHourlyPVcount.data:
combined_fig.add_trace(trace, row=2, col=1)
for trace in figHourlyUVcount.data:
combined_fig.add_trace(trace, row=2, col=2)
combined_fig.update_layout(width=1100, title_text='Daily & hourly page views and user quantity',title_x=0.5)
# Images are shown here instead of plotly plots which is big size.
img_bytes = combined_fig.to_image(format='png')
Image(img_bytes)
# From the below plots, we can tell
# 1: Daily page views and daily users are steadily but slowly increase with a hike at 12/12 when there is a promotion. We need to get more new daily users.
# 2: User quantity viewing the website is quite similar from 10am - 10pm. 9-10 pm is the most active time. Promotions is more effective during this time.
Out[19]:
InĀ [20]:
# Overall conversion quantity and rate
fig_conversion_combined = make_subplots(rows=1, cols=2, subplot_titles=('conversion', 'conversion percentage'), column_widths=[0.5, 0.5])
behavior_list = sorted(df.groupby('behavior_type').size().to_list(),reverse=True)
# Calculate percentage and add another funnel plot with percentage
behavior_percentage = []
for i in range(len(behavior_list)-1):
behavior_percentage.append(behavior_list[i+1]/behavior_list[0])
behavior_percentage.insert(0,1)
df1 = pd.DataFrame({'y':['view','add to cart','add to favorite','purchase'], 'x':behavior_list})
df2 = pd.DataFrame({'y':['view','add to cart','add to favorite','purchase'], 'x':behavior_percentage})
plot1 = px.funnel(df1,x='x',y='y')
plot2 = px.funnel(df2,x='x',y='y')
for trace in plot1.data:
fig_conversion_combined.add_trace(trace, row=1, col=1)
for trace in plot2.data:
fig_conversion_combined.add_trace(trace, row=1, col=2)
fig_conversion_combined.update_layout(width=1100, title_text='Total conversion quantity and conversion rate',title_x=0.5)
img_bytes = fig_conversion_combined.to_image(format='png')
Image(img_bytes)
# Conversion rate is not that good which is 1% comparing to a good e-commerce conversion rate 2.5%-3%. We need to increase it.
Out[20]:
InĀ [21]:
# different conversion paths by users using Sankey diagram
# Groupby user_id and behavior_type.
df_temp = df.groupby(['user_id','behavior_type'],as_index=False).size()
# Using pivot_table to reshape the DataFrame
df_pivot = df_temp.pivot_table(index='user_id', columns='behavior_type', aggfunc=len, fill_value=0)
# Resetting the index to make 'user_id' a column
df_pivot.reset_index(inplace=True)
# Rename the columns
df_pivot.columns = ['user_id','view','fav','cart','purchase']
# Build nodes and links for Sankey diagram
# As Total views is 19999 and Total activity is 20000, we will start the diagram from views and ignore the other route.
TotalActivity = len(df_pivot)
TotalView = df_pivot['view'].sum()
V = df_pivot['view'].apply(lambda x:x==1)
C = df_pivot['cart'].apply(lambda x:x==1)
F = df_pivot['fav'].apply(lambda x:x==1)
P = df_pivot['purchase'].apply(lambda x:x==1)
VC = len(df_pivot[V&C&~F]) # View to cart
VF = len(df_pivot[V&~C&F]) # View to favorite
VCF = len(df_pivot[V&C&F]) # View to cart & favorite
VP = len(df_pivot[V&~C&~F&P]) # View to purchase
VE = TotalView-VC-VF-VCF-VP # View to exit
VCP = len(df_pivot[V&C&~F&P]) # View to cart to purchase
VCE = len(df_pivot[V&C&~F&~P]) # View to cart to exit
VFP = len(df_pivot[V&~C&F&P]) # View to favorite to purchase
VFE = len(df_pivot[V&~C&F&~P]) # View to favorite to exit
VCFP = len(df_pivot[V&C&F&P]) # View to cart & favorite to purchase
VCFE = len(df_pivot[V&C&F&~P]) # View to cart & favorite to exit
# Get nodes and links
NodeList = ['View','Add to cart','Add to favorite','Cart & Favorite','Purchase','Exit']
NodeNumList = [0 , 1 , 2 , 3 , 4 , 5]
Nodes=dict(label=NodeList)
link_values = [VC,VF,VCF,VP,VE,VCP,VCE,VFP,VFE,VCFP,VCFE]
# assign random link colors to easier differentiate different paths
random.seed(192)
link_colors = ['#%02x%02x%02x' % (random.randint(30, 255), random.randint(30, 255), random.randint(30, 255))
for _ in range(len(link_values))]
Links = {'source':[0, 0, 0, 0, 0, 1, 1, 2, 2, 3, 3],
'target':[1, 2, 3, 4, 5, 4, 5, 4, 5, 4, 5],
'value': link_values,
'color': link_colors
}
data = go.Sankey(node=Nodes, link=Links)
fig = go.Figure(data)
fig.update_layout(width=1100, title_text='Different conversion paths by users',title_x=0.5)
img_bytes = fig.to_image(format='png')
Image(img_bytes)
Out[21]:
InĀ [22]:
# RFM. Segment the customers to "almost lost, low value, high value, top customers" according to the rfm score. 'M' is not doable in this dataset.
# Recency. Calculate how many days from the last purchase until LastDate for each user.
LastDate = df.date.max()
RecencyDF = df[df.behavior_type==4].groupby('user_id',as_index=False).date.apply(lambda x: LastDate-x.sort_values().iloc[-1]).rename(columns={'date':'recency'})
RecencyDF.recency = RecencyDF.recency.apply(lambda x:x.days) # get timedelta column numeric part using days property
# Frequency.Calculate how many purchases each user had made
FrequencyDF = df[df.behavior_type==4].groupby('user_id',as_index=False).date.count().rename(columns={'date':'frequency'})
rfm=pd.merge(RecencyDF,FrequencyDF,left_on='user_id',right_on='user_id',how='outer')
Labels = [i for i in range(1,5)]
ReverseLabels = [i for i in range(4,0,-1)]
# As recent purchase is better and more frequent is better, we use 4 to 1 for recency low to high ranking and 1 to 4 for frequency low to high ranking.
rfm['R_quartile']=pd.qcut(rfm.recency,4,labels=ReverseLabels)
rfm['F_quartile']=pd.qcut(rfm.frequency,4,labels=Labels)
rfm['RFM_Score'] = rfm[['R_quartile','F_quartile']].sum(axis=1)
# assign different labels to different groups of customers.
CategoryLabels=['almost Lost Customers', 'Low Value Customers', 'High value Customer','Top Customers']
RFM_Score=pd.qcut(rfm['RFM_Score'],4,labels=CategoryLabels)
rfm['category'] = RFM_Score
# print(tabulate(rfm.head(),headers=rfm.columns, tablefmt='html', numalign="center"))
user_id | recency | frequency | R_quartile | F_quartile | RFM_Score | category | |
---|---|---|---|---|---|---|---|
0 | 492 | 4 | 5 | 3 | 2 | 5 | Low Value Customers |
1 | 3726 | 8 | 4 | 1 | 1 | 2 | almost Lost Customers |
2 | 36465 | 23 | 1 | 1 | 1 | 2 | almost Lost Customers |
3 | 37101 | 7 | 3 | 2 | 1 | 3 | almost Lost Customers |
4 | 38745 | 22 | 4 | 1 | 1 | 2 | almost Lost Customers |