Data Science Capstone Project: Machine Learning using Starbucks Offer data
Introduction
Starbucks sends out various promotional offers to its app users. We have a simulated data that mimics customer response to those offers. As a Capstone project, we will walk through how to extract insight from the dataset.
We will go thru the steps shown below to complete this project.
- We begin by defining a problem we are trying to solve.
- We then explore the given data and prepare the data for machine learning(ML) algorithms.
- Then, we train various models and use the k-fold cross-validation technique to pick the best model for the problem.
- We use Grid Search to optimize the chosen model with the best performing hyperparameters and do the k-fold cross-validation again to make the model more robust.
- We use the chosen model to predict our test data set and measure the accuracy.
- We conclude the project by highlighting the feature importance of the ML model we build.
The Problem Definition
We are going to explore the Starbucks simulated marketing promotions data and customers’ responses toward the promotions. We will focus on two promotions, BOGO (Buy one and get one free) and discount offers.
We will use a binary classifier ML algorithm to predict which customers will view and complete those offers.
Performance Metric
We will use the accuracy metric to measure the performance of the classifier model we build because the distribution of the labeled data is balanced. We will discuss this in detail later.
ML model objectives and constraints
- No low latency is required since we are dealing with the offline data.
- Interpretability is crucial because we want to know the attributions of customers who completed offers.
- The error is not costly
Data Exploration
There are the following three data sets for this project. We read them into the following three DataFrames, portfolio, profile, and transcript
- portfolio: containing offers ids and metadata about each offer
2. profile: demographic data for each customer
3. transcript — a record for transactions, offers received, offers viewed, and offers completed.
Data Prep
There are many categorical data columns. We need to clean and transform them into numerical data columns.
Cleaning the portfolio DataFrame
In the portfolio Dataframe, the offer_type
column contains non-ordinal categorical data. We can convert them using the One-Hot encoding function get_dummies()
. We will get the following three new columns, BOGO
, discount
, and informational
.
pd.get_dummies(portfolio['offer_type'])
The channel
column is a bit more tricky because each row contains multiple values rather than a single value. So we can not use the get_dummies()
function as we did for the offer_type
column. Instead, we can create four columns manually to represent the possible values such as email
, mobile
, social
, and web
. If a row contains any of those words, we set the corresponding column value to 1.
portfolio_clean['email'] = portfolio_clean['channels'].astype(str).str.contains('email').astype(int)
portfolio_clean['mobile'] = portfolio_clean['channels'].astype(str).str.contains('mobile').astype(int)
portfolio_clean['social'] = portfolio_clean['channels'].astype(str).str.contains('social').astype(int)
portfolio_clean['web'] = portfolio_clean['channels'].astype(str).str.contains('web').astype(int)
Cleaning the transcript DataFrame
There are two categorical columns, event
, and value
. For the event
column, we can use the get_dummies()
function to perform One-Hot encoding.
pd.get_dummies(transcript['event'])
The value
column is interesting because each row has a Dictionary type. We can convert the Dictionary’s keys as new column names and the Dictionary’s values as row data using the apply()
function. The four additional columns will be created, offer id
, amount
, offer_id
and reward
.
transcript['value'].apply(pd.Series)
As shown above, the Dictionary data conversion created four additional columns, offer id
, amount
, offer_id
, reward
. However, the two columns, offer id
and offer_id
, should have been combined into one column.
#: combine offer_id and offer id columns
transcript['offer_id'].fillna(transcript['offer id'], inplace=True)#: drop the offer id column
transcript.drop(columns=['offer id'], inplace=True)
In the combined offer_id
column, some rows contain NaN. We will remove those rows because they don’t have any information regarding promotional offers. We can drop the transaction
, amount
, reward
columns that are irrelevant to promotional offers.
#: Filter out transcript without offer id
#: Remove not related columsn for offer: transacion, amount
transcript_clean = transcript_clean[~transcript_clean['offer_id'].isnull()].drop(columns=['transaction', 'amount','reward'])
Merge Dataframes
We can merge the cleaned transcript
and portfolio
DataFrames first.
#: merge transcript and portfolio
transcript_portfolio = pd.merge(transcript_clean, portfolio_clean, how='left', left_on='offer_id', right_on='id')
transcript_portfolio.drop(columns=['id'], inplace=True)
Then, we can merge again with the profile
Dataframe. Now we have a DataFrame combined with all three given datasets.
Lable data, Viewed and Completed Offer
Some customers completed offers without actually viewing them first. They happen to make transactions while some promotional offers were available. So we need to exclude them. We can define our label data as those offers that were viewed and completed, not just completed.
To make the label data column, we will group by the person
and offer_id
columns and create a new DataFrame with a new column, viewed and completed
indicating customers who actually viewed an offer first and then completed the offer.
group_person_offer = transcript_portfolio_profile.groupby(['person', 'offer_id'],as_index=False).sum()[['person', 'offer_id', 'offer completed', 'offer received', 'offer viewed']]#: create extra column to compute viewed and completed count
group_person_offer['completed-viewed'] = group_person_offer['offer completed'].astype(int) - group_person_offer['offer viewed'].astype(int) #: new column viewed and completed
edgroup_person_offer['viewed and completed'] = group_person_offer['completed-viewed'].map(lambda x: 1 if x==0 else 0 )group_person_offer['viewed and completed'] = group_person_offer['viewed and completed'].astype(int) * group_person_offer['offer completed'].astype(int) #: drop the extra column
group_person_offer.drop(columns=['completed-viewed'], inplace=True)
Let’s do some analysis to find out the rate of completion for BOGO and discount offers
#: viewed and completed offer
viewed_completed_offers = group_person_offer.groupby(['offer_type'])['offer completed','offer received','offer viewed','viewed and completed'].sum()viewed_completed_offers['viewed and completed rate'] = viewed_completed_offers['viewed and completed']/viewed_completed_offers['offer received']viewed_completed_offers['completed rate'] = viewed_completed_offers['offer completed']/viewed_completed_offers['offer received']viewed_completed_offers['completed without viewing rate'] = viewed_completed_offers['completed rate'] - viewed_completed_offers['viewed and completed rate']
We found out that BOGO offers have 45% and the discount offers have 48% of viewed and completed respectively. Discount has a little bit higher chance of completion than BOGO offers.
We can also see 12% and 16% of customers completed offers without viewing offers for BOGO and Discount. These are the type of customers Starbucks doesn’t need to send offers to because they would have purchased whether they received any offers.
Prepare data for ML models
We can use the group_person_offer
Dataframe created in the previous step and merge it with previously cleaned Dataframes, profile
, and portfolio
, to prepare a dataset called combined_data
. It would be the DataFrame that we can use to build ML models because the combined_data DataFrame has the label data column viewed and completed
and other columns for feature data.
#: merge all the data set to have all of features
combined_data = pd.merge(group_person_offer, profile, how='left', left_on = 'person', right_on='id')combined_data = pd.merge(combined_data, portfolio_clean, how='left', left_on = 'offer_id', right_on='id')combined_data.drop(columns=['id_x', 'id_y','offer_type_y'], inplace=True)combined_data.rename(columns={'offer_type_x':'offer_type'}, inplace=True)
Prepare data for BOGO ML models
There are two types of offers, BOGO and discount, that we want to study in this project. Therefore, we will create two datasets, combined_data4Bogo
and combined_data4Discount
.
#: separate BOGO and Discount data
combined_data4Bogo = combined_data[combined_data['bogo']==1]
combined_data4Bogo.drop(columns=['bogo', 'discount', 'informational', 'person', 'offer_id', 'offer completed', 'offer received', 'offer viewed', 'offer_type'], inplace=True)
BOGO Label Data
We change the value in the viewed and completed
column to either 1 or 0. It is the column we will use as label data for our binary classification model.
#: change the label (y variable) to 1 or 0
combined_data4Bogo['viewed and completed'] = np.where(combined_data4Bogo['viewed and completed']>0, 1,0)
Feature Engineering for BOGO feature data
Categorical data, ‘Age’.
The age
column has the above distribution. Age groups are concentrated between 40 and 75. We can encode this column into the following five sub-categorial data and create a new column, age_cat
using the pd.cut()
function. It is an ordinal encoding technique because the higher the number group implies, the higher the age group.
- Less than 20 => 1
- Between 20 and 40 => 2
- Between 40 and 60 => 3
- Between 60 and 80 => 4
- 80+ => 5
combined_data4Bogo['age_cat'] = pd.cut(combined_data4Bogo['age'],
bins [0,20,40,60,80,np.inf],labels = [1,2,3,4,5])#: 'teens','20-30','40-50','60-80', '80+'
combined_data4Bogo['age_cat'].hist()#: remove the age column
combined_data4Bogo.drop(columns=['age'], inplace=True)
To preserve the above distribution, we will use the stratified sampling technique to separate out test and training data sets later.
Categorical data, ‘Income’.
Similar to the age
column, the income
column distribution is concentrated between 50k and 75k. Also, we can see that there is a hard cap income level at 120k. We can create a new ordinal encoded column income_cat
with the following four sub-categories as we did on the age
column.
1. Less than 45K => 1
2. Between 45K and 75K
3. Between 75K and 100K
4. 100K+
combined_data4Bogo['income_cat'] = pd.cut(combined_data4Bogo['income'],
bins=[0,45000,75000,100000,np.inf],
labels = [1,2,3,4])#: 'less45k','45k-75k','75k-100','100k+'
combined_data4Bogo['income_cat'].hist()#: remove the incoe column
combined_data4Bogo.drop(columns=['income'], inplace=True)
Categorical data, ‘Gender’
We can use the One-Hot encoding method for the gender
column.
#: use one hot encoding
gender_data = pd.get_dummies(combined_data4Bogo['gender'], prefix='gender')combined_data4Bogo.drop(columns=['gender'], inplace=True)combined_data4Bogo = pd.concat([combined_data4Bogo, gender_data],axis=1)
Remove additional feature data columns for BOGO
Create the histogram of the BOGO dataset.
combined_data4Bogo.hist(figsize=(15,10))
plt.show()
We can remove the email
and mobile
columns because they have a single value for the entire column, and they will not add any value to an ML model. We can remove the become_member_on
column as well.
combined_data4Bogo.drop(columns=['email','mobile','become_member_on], inplace=True)
Other categorical data set for BOGO
We can use the one-hot encoding method for the other categorical data columns.
#: other categorical data
duration_data = pd.get_dummies(combined_data4Bogo['duration'], prefix='duration')
difficulty_data = pd.get_dummies(combined_data4Bogo['difficulty'], prefix='difficulty')
reward_data = pd.get_dummies(combined_data4Bogo['reward'], prefix='reward')
social_data = pd.get_dummies(combined_data4Bogo['social'], prefix='social')combined_data4Bogo = pd.concat([combined_data4Bogo, duration_data],axis='columns')
combined_data4Bogo = pd.concat([combined_data4Bogo, difficulty_data],axis='columns')
combined_data4Bogo = pd.concat([combined_data4Bogo, reward_data],axis='columns')
combined_data4Bogo = pd.concat([combined_data4Bogo, social_data],axis='columns')combined_data4Bogo.drop(['duration','difficulty','reward','social'], axis='columns', inplace=True)
Build Models using the BOGO data
Split Train and Test data using Stratified
We will split the Train and Test data set first. We use the stratify option to keep similar distribution for those encoded columns.
train, test = train_test_split(combined_data4Bogo, test_size=0.2, random_state=0, stratify =combined_data4Bogo[['income_cat', 'age_cat','gender_F','gender_M']])
We can check the distribution of those columns before and after the split.
test["income_cat"].value_counts() / len(test)2 0.500230
3 0.223116
1 0.207491
4 0.069164combined_data4Bogo["income_cat"].value_counts() / len(combined_data4Bogo)2 0.499885
3 0.223034
1 0.207638
4 0.069443
We will change the ordinal encoded columns, age_cat
, and income_cat
to one-hot encoding, columns to have a more meaningful feature importance study for an ML model. Instead of doing ordinal encoding in the earlier step, we could have done one-hot encoding first. But then, when we split the data into the training and test datasets, the stratify option would become more complicated.
Separate the label and feature data for Train and Test data sets.
y_train = train.iloc[:,0]
y_test = test.iloc[:,0]X_train = train.iloc[:,1:]
X_test = test.iloc[:,1:]
Check the distribution of label data
The Train label data seem balanced; therefore, we can use the accuracy as the performance metrics for the classifier we will build.
y_train.hist()
K-fold cross-validation to pick classifier models
Use the following classification models.
models = []
#: Decision Tree model
models.append(('CART', DecisionTreeClassifier()))
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('AB', AdaBoostClassifier()))
models.append(('GBM', GradientBoostingClassifier()))
models.append(('RF', RandomForestClassifier()))
models.append(('ET', ExtraTreesClassifier()))
Then, use K-fold cross-validation to pick out the best performing classifier.
num_folds = 4
seed = 7
scoring = 'accuracy'results = []
names = []
for name, model in models:
#kfold = KFold(n_splits=num_folds, random_state=seed)
cv_results = cross_val_score(model, X_train, y_train, cv=5, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)CART: 0.638021 (0.006344)
LR: 0.623427 (0.001539)
KNN: 0.571838 (0.010391)
AB: 0.622451 (0.001639)
GBM: 0.634401 (0.006262)
RF: 0.637618 (0.004874)
ET: 0.638021 (0.006344)
Refinement
Although most models, including GradientBoostingClassifier(GBM) and Random Forest (FR), performed with similar results, we can choose the Decision Tree (CART) as the classifier for this model.
The CART model is a simpler model than those ensemble models. The model can learn the nonlinear relationships between the variables and can describe feature importance. Also, the CART model provides the feature importance.
Let’s use the Grid Search technique to turn the hyperparameter to get the best hyperparameters to predict the test dataset.
For the Decision Tree algorithm there are the following hyperparameters we can tune:
class_weight
criterion
max_depth
max_features
max_leaf_nodes
min_impurity_decrease
min_impurity_split
min_samples_leaf
min_samples_split
min_weight_fraction_leaf
presort
random_state
splitter
We will turn criterion
, max_depth
, min_samples_leaf
, min_samples_split
hyperparameters to find the best performing Decision Tree classifier model.
param_dict = {
'criterion':['gini','entropy'],
'max_depth': range(1,20),
'min_samples_split': range(2,10),
'min_samples_leaf': range(1,5)
}
Model evaluation and validation
After using Grid Search, we found the following hyperparameter gives the best result for the training dataset.
grid_result.best_params_{'criterion': 'gini',
'max_depth': 10,
'min_samples_leaf': 1,
'min_samples_split': 2}
The best performing hyperparameter values for criterion
gini
, max_depth
, min_samples_leaf
and min_samples_split
are gini
, 10
, 1
and 2
, respectively.
Justification
grid_result.best_score_0.637
The accuracy metric using the default Decision Tree was 0.638. The metric after Grid Search was 0.637 which is slightly less than 0.638 but very close. The 0.637 result value was computed using the best performing model and the model was chosen after fitting 10 folds for each of 1216 candidates using the Grid Search. That being said the model seems very robust model displaying the very consistent results.
Predict using the Test dataset
#: build the model with best hyperparameters chosen
model = DecisionTreeClassifier(criterion='gini', max_depth=10, min_samples_leaf=1, min_samples_split=2)
#: fit the model to the training data
model.fit(X_train, y_train)#: make predictions on the test data
y_pred = model.predict(X_test)#: Check the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)0.6438
Conclusion for the BOGO model
Reflection
The Decision Tree classification model’s accuracy using the test data is 64% which is not that great. However, the result is better than a random model with a baseline accuracy of 50%. The fact that the similar accuracy metrics between the train and the test data results indicating that the model is not overfitted. Also the model satisfies the objectives we defined in beginning. No low-latency and interpretability of the model.
Let’s interpreate the model thru the feature’s importance.
feature_importance = pd.Series(data=model.feature_importances_, index=X.columns)
feature_importance.sort_values(ascending=False).head(15).plot(kind='bar')
For the income category, the most important one was Less than 45K (income_cat_1)
. It is the lowest income category available. The second most important feature for BOGO was gender_F
indicating female customers.
In conclusion, we built a classification model to predict which customers would complete BOGO offers if they receive those offers. The model predicts that BOGO offers have more chance to be completed by females with relatively lower-income customers with 64% accuracy.
Build Models using the discount data
With similar steps as BOGO, we can build a classifier for the discount offer data.
Like the BOGO data, we can pick the Decision Tree (CART) classifier and picked the best performing hyperparameters using Grid Search and K-fold to predict the test dataset for the discount offers.
#: pick the best model
model = DecisionTreeClassifier(criterion='entropy', max_depth=6, min_samples_leaf=4, min_samples_split=2)
#: fit the model to the training data
model.fit(X_train, y_train)#: make predictions on the test data
y_pred = model.predict(X_test)#: Check the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)0.700
Conclusion for the discount model
Reflection
The Decision Tree classification model’s accuracy using the test data is 70% which is better than the BOGO data result.
Let’s examine the feature’s importance for the classification.
The social_1
feature has the highest importance. It indicates that the discount promotional offers that were delivered by social media have a good chance of completing the offer. In other words, discount promotional offers through social media platforms are more effective than those via non-social media platforms. The second most important feature for discount offers was Less than 45K (income_cat_1)
. Like BOGO offers, the income level feature is one of the key features for discount offers.
Improvement
We’d like to improve the model’s accuracy by making better features. We could collect more attributes of customers to have more feature variables to improve our model.
Another improvement would be splitting the Train/Test data based on the Time Series features. If we have more evenly distributed time-series features such as, Become_Member_On
we could use the time-based splitting technique. But the time-based data are not evenly distributed so we ended up dropping the column.