E-commerce insight case

The dataset we use here is an e-commerce dataset including 20 millions rows.There are 6 columns.User_id is a numeric user id. Item_id is a numeric item id. Behavior_type is a single number representing different user activity on the website.1=view, 2=add to favorite,3=add to cart,4=purchase. User_geohash is hashed location data.Item_category is a numeric number representing item category. Time includes date and hour when each activity happened.

	user_id	item_id	behavior_type	user_geohash	item_category	time
0	10001082	285259775	1	97lk14c	4076	2014-12-08 18

We try to figure out the below through this case.

Trend of daily pages views and total users viewed the website
What is the best time range in a day to do promotions?
What is the overall convention rate and how it compare to industry standard?
What are the users' activities paths?
How to segment users?

In [15]:

# Packages import
import pandas as pd
import datetime
import kaleido
import plotly.graph_objects as go
import plotly.express as px
import random
from plotly.subplots import make_subplots
from IPython.display import Image
from tabulate import tabulate

In [16]:

# read data file
file_name = 'ECommerce.csv'
df = pd.read_csv(file_name)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# print original dataframe
# print(tabulate(df.head(2), headers='keys', tablefmt='html', numalign="center"))

Original dataframe

	user_id	item_id	behavior_type	user_geohash	item_category	time
0	10001082	285259775	1	97lk14c	4076	2014-12-08 18
1	10001082	4368907	1	nan	5503	2014-12-12 12

In [17]:

# duplicates check. As the time column in this dataset is to hour level only thus there are a lot of duplicate. We skip duplicates check.
# as we only need user_id, behavior_type and time columns this time, we drop the others.
df.drop(columns=['user_geohash','item_id','item_category'],inplace=True)
# null check. No null values found.
print('\n*** Null check ***')
print(df.isnull().sum())
# transform time column. As we want to do daily and hourly analysis, we separate it to date & hour columns. We can then drop the time column.
# The below method was used at the beginning. Since the dataset has more then 20 millions rows, it takes about 4 mins to finish the below step.
# df['date'] = df['time'].apply(lambda x: x[:10]).apply(lambda x: datetime.datetime.strptime(str(x),"%Y-%m-%d")) 
# df['hour'] = df['time'].apply(lambda x:x[11:]).apply(lambda x:int(x)) 
# the vectorized approach is used for optimized performance.20 seconds processing time instead of 4 minutes.
df.time = pd.to_datetime(df.time)
df['date'] = df.time.dt.date
df['hour'] = df.time.dt.hour
df.drop(columns=['time'],inplace=True)
# Outliers, Abnormal data check. Use describe() to find if date, hour, behavior_type are in correct ranges. 
# As it shows below, data are all good cos the correct ranges are: 1 < behavior_type < 4, 2014-11-18 < date < 2014-12-18, 0 < hour < 23.
print('\n\t\t*** Outliers check ***')
print((df[['behavior_type','date','hour']].describe(include='all',datetime_is_numeric=True).round(1)))

*** Null check ***
user_id          0
behavior_type    0
time             0
dtype: int64

		*** Outliers check ***
        behavior_type        date        hour
count      23291027.0    23291027  23291027.0
unique            NaN          31         NaN
top               NaN  2014-12-12         NaN
freq              NaN     1344980         NaN
mean              1.1         NaN        14.9
std               0.5         NaN         6.4
min               1.0         NaN         0.0
25%               1.0         NaN        11.0
50%               1.0         NaN        16.0
75%               1.0         NaN        20.0
max               4.0         NaN        23.0

In [18]:

# The cleaning and wrangling are finished. Let's check the final dataframe.
# print(tabulate(df.head(2), headers='keys', tablefmt='html', numalign="center"))

Final dataframe

	user_id	behavior_type	date	hour
0	10001082	1	2014-12-08	18
1	10001082	1	2014-12-12	12

In [19]:

# Daily page views and total users visited the website
df['date'] = pd.to_datetime(df['date'])
DailyPV = df.groupby(['date']).size().reset_index(name = 'DailyPVcount')
DailyUV = df.groupby(['date','user_id']).size().groupby('date').size().reset_index(name = 'DailyUVcount')
# Hourly page views and users quantity who visited the website
HourlyPV = df.groupby(['hour']).size().reset_index(name = 'HourlyPVcount')
HourlyUV = df.groupby(['hour','user_id']).size().groupby('hour').size().reset_index(name = 'HourlyUVcount')
# Visualize
figDailyPVcount = px.scatter(DailyPV, x = 'date', y = 'DailyPVcount', trendline = 'ols')
figDailyUVcount = px.scatter(DailyUV, x = 'date', y = 'DailyUVcount', trendline = 'ols')
figHourlyPVcount = px.scatter(HourlyPV, x = 'hour', y = 'HourlyPVcount', trendline = 'ols')
figHourlyUVcount = px.scatter(HourlyUV, x = 'hour', y = 'HourlyUVcount', trendline = 'ols')
combined_fig = make_subplots(rows=2, cols=2, subplot_titles=('Daily Page Views', 'Daily Users','Hourly Page Views', 'Hourly Users'))
# Add each Plotly Express plot as a subplot
for trace in figDailyPVcount.data:
    combined_fig.add_trace(trace, row=1, col=1)
for trace in figDailyUVcount.data:
    combined_fig.add_trace(trace, row=1, col=2)
for trace in figHourlyPVcount.data:
    combined_fig.add_trace(trace, row=2, col=1)
for trace in figHourlyUVcount.data:
    combined_fig.add_trace(trace, row=2, col=2)    
combined_fig.update_layout(width=1100, title_text='Daily & hourly page views and user quantity',title_x=0.5)
#  Images are shown here instead of plotly plots which is big size.
img_bytes = combined_fig.to_image(format='png')
Image(img_bytes)
# From the below plots, we can tell 
# 1: Daily page views and daily users are steadily but slowly increase with a hike at 12/12 when there is a promotion. We need to get more new daily users.
# 2: User quantity viewing the website is quite similar from 10am - 10pm. 9-10 pm is the most active time. Promotions is more effective during this time.

Out[19]:

No description has been provided for this image

In [20]:

# Overall conversion quantity and rate
fig_conversion_combined = make_subplots(rows=1, cols=2, subplot_titles=('conversion', 'conversion percentage'), column_widths=[0.5, 0.5])
behavior_list = sorted(df.groupby('behavior_type').size().to_list(),reverse=True)
# Calculate percentage and add another funnel plot with percentage
behavior_percentage = []
for i in range(len(behavior_list)-1):
    behavior_percentage.append(behavior_list[i+1]/behavior_list[0])
behavior_percentage.insert(0,1)
df1 = pd.DataFrame({'y':['view','add to cart','add to favorite','purchase'], 'x':behavior_list})
df2 = pd.DataFrame({'y':['view','add to cart','add to favorite','purchase'], 'x':behavior_percentage})
plot1 = px.funnel(df1,x='x',y='y')
plot2 = px.funnel(df2,x='x',y='y')
for trace in plot1.data:
    fig_conversion_combined.add_trace(trace, row=1, col=1)
for trace in plot2.data:
    fig_conversion_combined.add_trace(trace, row=1, col=2)
fig_conversion_combined.update_layout(width=1100, title_text='Total conversion quantity and conversion rate',title_x=0.5)
img_bytes = fig_conversion_combined.to_image(format='png')
Image(img_bytes)
# Conversion rate is not that good which is 1% comparing to a good e-commerce conversion rate 2.5%-3%. We need to increase it.

Out[20]:

In [21]:

# different conversion paths by users using Sankey diagram
# Groupby user_id and behavior_type.
df_temp = df.groupby(['user_id','behavior_type'],as_index=False).size()
# Using pivot_table to reshape the DataFrame
df_pivot = df_temp.pivot_table(index='user_id', columns='behavior_type', aggfunc=len, fill_value=0)
# Resetting the index to make 'user_id' a column
df_pivot.reset_index(inplace=True)
# Rename the columns
df_pivot.columns = ['user_id','view','fav','cart','purchase']
# Build nodes and links for Sankey diagram
# As Total views is 19999 and Total activity is 20000, we will start the diagram from views and ignore the other route.
TotalActivity = len(df_pivot)
TotalView = df_pivot['view'].sum()
V = df_pivot['view'].apply(lambda x:x==1)
C = df_pivot['cart'].apply(lambda x:x==1)
F = df_pivot['fav'].apply(lambda x:x==1)
P = df_pivot['purchase'].apply(lambda x:x==1)
VC = len(df_pivot[V&C&~F])              # View to cart
VF = len(df_pivot[V&~C&F])              # View to favorite  
VCF = len(df_pivot[V&C&F])              # View to cart & favorite
VP = len(df_pivot[V&~C&~F&P])           # View to purchase
VE = TotalView-VC-VF-VCF-VP             # View to exit
VCP = len(df_pivot[V&C&~F&P])           # View to cart to purchase
VCE = len(df_pivot[V&C&~F&~P])          # View to cart to exit
VFP = len(df_pivot[V&~C&F&P])           # View to favorite to purchase
VFE = len(df_pivot[V&~C&F&~P])          # View to favorite to exit
VCFP = len(df_pivot[V&C&F&P])           # View to cart & favorite to purchase
VCFE = len(df_pivot[V&C&F&~P])          # View to cart & favorite to exit
# Get nodes and links
NodeList =    ['View','Add to cart','Add to favorite','Cart & Favorite','Purchase','Exit']
NodeNumList = [0     , 1           , 2               , 3               , 4        , 5]
Nodes=dict(label=NodeList)
link_values = [VC,VF,VCF,VP,VE,VCP,VCE,VFP,VFE,VCFP,VCFE]
# assign random link colors to easier differentiate different paths
random.seed(192)
link_colors = ['#%02x%02x%02x' % (random.randint(30, 255), random.randint(30, 255), random.randint(30, 255))
               for _ in range(len(link_values))]
Links = {'source':[0, 0, 0, 0, 0, 1, 1, 2, 2, 3, 3],
         'target':[1, 2, 3, 4, 5, 4, 5, 4, 5, 4, 5],
         'value': link_values,         
         'color': link_colors             
        }
data = go.Sankey(node=Nodes, link=Links)
fig = go.Figure(data)
fig.update_layout(width=1100, title_text='Different conversion paths by users',title_x=0.5)
img_bytes = fig.to_image(format='png')
Image(img_bytes)

Out[21]:

In [22]:

# RFM. Segment the customers to "almost lost, low value, high value, top customers" according to the rfm score. 'M' is not doable in this dataset.
# Recency. Calculate how many days from the last purchase until LastDate for each user.
LastDate = df.date.max()
RecencyDF = df[df.behavior_type==4].groupby('user_id',as_index=False).date.apply(lambda x: LastDate-x.sort_values().iloc[-1]).rename(columns={'date':'recency'})
RecencyDF.recency = RecencyDF.recency.apply(lambda x:x.days)        # get timedelta column numeric part using days property
# Frequency.Calculate how many purchases each user had made
FrequencyDF = df[df.behavior_type==4].groupby('user_id',as_index=False).date.count().rename(columns={'date':'frequency'})
rfm=pd.merge(RecencyDF,FrequencyDF,left_on='user_id',right_on='user_id',how='outer')
Labels = [i for i in range(1,5)]
ReverseLabels = [i for i in range(4,0,-1)]
# As recent purchase is better and more frequent is better, we use 4 to 1 for recency low to high ranking and 1 to 4 for frequency low to high ranking.
rfm['R_quartile']=pd.qcut(rfm.recency,4,labels=ReverseLabels)
rfm['F_quartile']=pd.qcut(rfm.frequency,4,labels=Labels)
rfm['RFM_Score'] = rfm[['R_quartile','F_quartile']].sum(axis=1)
# assign different labels to different groups of customers.
CategoryLabels=['almost Lost Customers', 'Low Value Customers', 'High value Customer','Top Customers']
RFM_Score=pd.qcut(rfm['RFM_Score'],4,labels=CategoryLabels)
rfm['category'] = RFM_Score
# print(tabulate(rfm.head(),headers=rfm.columns, tablefmt='html', numalign="center"))

	user_id	recency	frequency	R_quartile	F_quartile	RFM_Score	category
0	492	4	5	3	2	5	Low Value Customers
1	3726	8	4	1	1	2	almost Lost Customers
2	36465	23	1	1	1	2	almost Lost Customers
3	37101	7	3	2	1	3	almost Lost Customers
4	38745	22	4	1	1	2	almost Lost Customers