Do you prefer BOGO or a discount offer on your Starbucks app?

J Song
12 min readApr 4, 2021

Data Science Capstone Project: Machine Learning using Starbucks Offer data

Introduction

Starbucks sends out various promotional offers to its app users. We have a simulated data that mimics customer response to those offers. As a Capstone project, we will walk through how to extract insight from the dataset.

We will go thru the steps shown below to complete this project.

  • We begin by defining a problem we are trying to solve.
  • We then explore the given data and prepare the data for machine learning(ML) algorithms.
  • Then, we train various models and use the k-fold cross-validation technique to pick the best model for the problem.
  • We use Grid Search to optimize the chosen model with the best performing hyperparameters and do the k-fold cross-validation again to make the model more robust.
  • We use the chosen model to predict our test data set and measure the accuracy.
  • We conclude the project by highlighting the feature importance of the ML model we build.

The Problem Definition

We are going to explore the Starbucks simulated marketing promotions data and customers’ responses toward the promotions. We will focus on two promotions, BOGO (Buy one and get one free) and discount offers.

We will use a binary classifier ML algorithm to predict which customers will view and complete those offers.

Performance Metric

We will use the accuracy metric to measure the performance of the classifier model we build because the distribution of the labeled data is balanced. We will discuss this in detail later.

ML model objectives and constraints

  • No low latency is required since we are dealing with the offline data.
  • Interpretability is crucial because we want to know the attributions of customers who completed offers.
  • The error is not costly

Data Exploration

There are the following three data sets for this project. We read them into the following three DataFrames, portfolio, profile, and transcript

  1. portfolio: containing offers ids and metadata about each offer
portfolio DataFrame

2. profile: demographic data for each customer

profile DataFrame

3. transcript — a record for transactions, offers received, offers viewed, and offers completed.

transcript DataFrame

Data Prep

There are many categorical data columns. We need to clean and transform them into numerical data columns.

Cleaning the portfolio DataFrame

In the portfolio Dataframe, the offer_type column contains non-ordinal categorical data. We can convert them using the One-Hot encoding function get_dummies(). We will get the following three new columns, BOGO, discount, and informational.

pd.get_dummies(portfolio['offer_type'])
One-Hot encoding of ‘offer_type’

The channel column is a bit more tricky because each row contains multiple values rather than a single value. So we can not use the get_dummies() function as we did for the offer_type column. Instead, we can create four columns manually to represent the possible values such as email, mobile, social, and web. If a row contains any of those words, we set the corresponding column value to 1.

portfolio_clean['email'] = portfolio_clean['channels'].astype(str).str.contains('email').astype(int) 
portfolio_clean['mobile'] = portfolio_clean['channels'].astype(str).str.contains('mobile').astype(int)
portfolio_clean['social'] = portfolio_clean['channels'].astype(str).str.contains('social').astype(int)
portfolio_clean['web'] = portfolio_clean['channels'].astype(str).str.contains('web').astype(int)
portfolio_clean Dataframe after the ‘channel’ column encoding

Cleaning the transcript DataFrame

There are two categorical columns, event, and value. For the event column, we can use the get_dummies() function to perform One-Hot encoding.

pd.get_dummies(transcript['event'])
One Hot encoding of transcript[‘event’]

The value column is interesting because each row has a Dictionary type. We can convert the Dictionary’s keys as new column names and the Dictionary’s values as row data using the apply() function. The four additional columns will be created, offer id, amount, offer_id and reward.

transcript['value'].apply(pd.Series)
transcript Dataframe after the ‘value’ column encoding

As shown above, the Dictionary data conversion created four additional columns, offer id, amount, offer_id, reward. However, the two columns, offer id and offer_id, should have been combined into one column.

#: combine offer_id and offer id columns
transcript['offer_id'].fillna(transcript['offer id'], inplace=True)
#: drop the offer id column
transcript.drop(columns=['offer id'], inplace=True)
transcript Dataframe

In the combined offer_id column, some rows contain NaN. We will remove those rows because they don’t have any information regarding promotional offers. We can drop the transaction, amount, reward columns that are irrelevant to promotional offers.

#: Filter out transcript without offer id 
#: Remove not related columsn for offer: transacion, amount
transcript_clean = transcript_clean[~transcript_clean['offer_id'].isnull()].drop(columns=['transaction', 'amount','reward'])
transcript_clean Dataframe

Merge Dataframes

We can merge the cleaned transcript and portfolio DataFrames first.

#: merge transcript and portfolio 
transcript_portfolio = pd.merge(transcript_clean, portfolio_clean, how='left', left_on='offer_id', right_on='id')
transcript_portfolio.drop(columns=['id'], inplace=True)
profile and transcript DataFrames merged

Then, we can merge again with the profile Dataframe. Now we have a DataFrame combined with all three given datasets.

profile, transcript, and portfolio Dataframe are merged

Lable data, Viewed and Completed Offer

Some customers completed offers without actually viewing them first. They happen to make transactions while some promotional offers were available. So we need to exclude them. We can define our label data as those offers that were viewed and completed, not just completed.

To make the label data column, we will group by the person and offer_id columns and create a new DataFrame with a new column, viewed and completed indicating customers who actually viewed an offer first and then completed the offer.

group_person_offer = transcript_portfolio_profile.groupby(['person', 'offer_id'],as_index=False).sum()[['person', 'offer_id', 'offer completed', 'offer received', 'offer viewed']]#: create extra column to compute viewed and completed count
group_person_offer['completed-viewed'] = group_person_offer['offer completed'].astype(int) - group_person_offer['offer viewed'].astype(int)
#: new column viewed and completed
edgroup_person_offer['viewed and completed'] = group_person_offer['completed-viewed'].map(lambda x: 1 if x==0 else 0 )
group_person_offer['viewed and completed'] = group_person_offer['viewed and completed'].astype(int) * group_person_offer['offer completed'].astype(int) #: drop the extra column
group_person_offer.drop(columns=['completed-viewed'], inplace=True)
group_person_offer Dataframe

Let’s do some analysis to find out the rate of completion for BOGO and discount offers

#: viewed and completed offer
viewed_completed_offers = group_person_offer.groupby(['offer_type'])['offer completed','offer received','offer viewed','viewed and completed'].sum()
viewed_completed_offers['viewed and completed rate'] = viewed_completed_offers['viewed and completed']/viewed_completed_offers['offer received']viewed_completed_offers['completed rate'] = viewed_completed_offers['offer completed']/viewed_completed_offers['offer received']viewed_completed_offers['completed without viewing rate'] = viewed_completed_offers['completed rate'] - viewed_completed_offers['viewed and completed rate']
Viewed_completed_offers
offer completion rate for BOGO vs discount offers

We found out that BOGO offers have 45% and the discount offers have 48% of viewed and completed respectively. Discount has a little bit higher chance of completion than BOGO offers.

We can also see 12% and 16% of customers completed offers without viewing offers for BOGO and Discount. These are the type of customers Starbucks doesn’t need to send offers to because they would have purchased whether they received any offers.

Prepare data for ML models

We can use the group_person_offer Dataframe created in the previous step and merge it with previously cleaned Dataframes, profile, and portfolio, to prepare a dataset called combined_data. It would be the DataFrame that we can use to build ML models because the combined_data DataFrame has the label data column viewed and completed and other columns for feature data.

#: merge all the data set to have all of features 
combined_data = pd.merge(group_person_offer, profile, how='left', left_on = 'person', right_on='id')
combined_data = pd.merge(combined_data, portfolio_clean, how='left', left_on = 'offer_id', right_on='id')combined_data.drop(columns=['id_x', 'id_y','offer_type_y'], inplace=True)combined_data.rename(columns={'offer_type_x':'offer_type'}, inplace=True)

Prepare data for BOGO ML models

There are two types of offers, BOGO and discount, that we want to study in this project. Therefore, we will create two datasets, combined_data4Bogo and combined_data4Discount.

#: separate BOGO and Discount data
combined_data4Bogo = combined_data[combined_data['bogo']==1]
combined_data4Bogo.drop(columns=['bogo', 'discount', 'informational', 'person', 'offer_id', 'offer completed', 'offer received', 'offer viewed', 'offer_type'], inplace=True)
combinded_data4Bogo

BOGO Label Data

We change the value in the viewed and completed column to either 1 or 0. It is the column we will use as label data for our binary classification model.

#: change the label (y variable) to 1 or 0
combined_data4Bogo['viewed and completed'] = np.where(combined_data4Bogo['viewed and completed']>0, 1,0)

Feature Engineering for BOGO feature data

Categorical data, ‘Age’.

combined_data4Bogo(‘age) distribution

The age column has the above distribution. Age groups are concentrated between 40 and 75. We can encode this column into the following five sub-categorial data and create a new column, age_cat using the pd.cut() function. It is an ordinal encoding technique because the higher the number group implies, the higher the age group.

  1. Less than 20 => 1
  2. Between 20 and 40 => 2
  3. Between 40 and 60 => 3
  4. Between 60 and 80 => 4
  5. 80+ => 5
combined_data4Bogo['age_cat'] = pd.cut(combined_data4Bogo['age'],
bins [0,20,40,60,80,np.inf],labels = [1,2,3,4,5])
#: 'teens','20-30','40-50','60-80', '80+'
combined_data4Bogo['age_cat'].hist()
#: remove the age column
combined_data4Bogo.drop(columns=['age'], inplace=True)
age_cat column

To preserve the above distribution, we will use the stratified sampling technique to separate out test and training data sets later.

Categorical data, ‘Income’.

combined_data4Bogo(‘income) distribution

Similar to the age column, the income column distribution is concentrated between 50k and 75k. Also, we can see that there is a hard cap income level at 120k. We can create a new ordinal encoded column income_cat with the following four sub-categories as we did on the age column.

1. Less than 45K => 1
2. Between 45K and 75K
3. Between 75K and 100K
4. 100K+

combined_data4Bogo['income_cat'] = pd.cut(combined_data4Bogo['income'],
bins=[0,45000,75000,100000,np.inf],
labels = [1,2,3,4])
#: 'less45k','45k-75k','75k-100','100k+'
combined_data4Bogo['income_cat'].hist()
#: remove the incoe column
combined_data4Bogo.drop(columns=['income'], inplace=True)
income_cat distribution

Categorical data, ‘Gender’

We can use the One-Hot encoding method for the gender column.

#: use one hot encoding
gender_data = pd.get_dummies(combined_data4Bogo['gender'], prefix='gender')
combined_data4Bogo.drop(columns=['gender'], inplace=True)combined_data4Bogo = pd.concat([combined_data4Bogo, gender_data],axis=1)
gender_data DataFrame

Remove additional feature data columns for BOGO

Create the histogram of the BOGO dataset.

combined_data4Bogo.hist(figsize=(15,10))
plt.show()
combined_data4Bogo DataFrame histogram

We can remove the email and mobile columns because they have a single value for the entire column, and they will not add any value to an ML model. We can remove the become_member_on column as well.

combined_data4Bogo.drop(columns=['email','mobile','become_member_on], inplace=True)

Other categorical data set for BOGO

We can use the one-hot encoding method for the other categorical data columns.

#: other categorical data
duration_data = pd.get_dummies(combined_data4Bogo['duration'], prefix='duration')
difficulty_data = pd.get_dummies(combined_data4Bogo['difficulty'], prefix='difficulty')
reward_data = pd.get_dummies(combined_data4Bogo['reward'], prefix='reward')
social_data = pd.get_dummies(combined_data4Bogo['social'], prefix='social')
combined_data4Bogo = pd.concat([combined_data4Bogo, duration_data],axis='columns')
combined_data4Bogo = pd.concat([combined_data4Bogo, difficulty_data],axis='columns')
combined_data4Bogo = pd.concat([combined_data4Bogo, reward_data],axis='columns')
combined_data4Bogo = pd.concat([combined_data4Bogo, social_data],axis='columns')
combined_data4Bogo.drop(['duration','difficulty','reward','social'], axis='columns', inplace=True)

Build Models using the BOGO data

Split Train and Test data using Stratified

We will split the Train and Test data set first. We use the stratify option to keep similar distribution for those encoded columns.

train, test = train_test_split(combined_data4Bogo, test_size=0.2, random_state=0, stratify =combined_data4Bogo[['income_cat', 'age_cat','gender_F','gender_M']])

We can check the distribution of those columns before and after the split.

test["income_cat"].value_counts() / len(test)2    0.500230
3 0.223116
1 0.207491
4 0.069164
combined_data4Bogo["income_cat"].value_counts() / len(combined_data4Bogo)2 0.499885
3 0.223034
1 0.207638
4 0.069443

We will change the ordinal encoded columns, age_cat, and income_cat to one-hot encoding, columns to have a more meaningful feature importance study for an ML model. Instead of doing ordinal encoding in the earlier step, we could have done one-hot encoding first. But then, when we split the data into the training and test datasets, the stratify option would become more complicated.

Separate the label and feature data for Train and Test data sets.

y_train = train.iloc[:,0]
y_test = test.iloc[:,0]
X_train = train.iloc[:,1:]
X_test = test.iloc[:,1:]

Check the distribution of label data

The Train label data seem balanced; therefore, we can use the accuracy as the performance metrics for the classifier we will build.

y_train.hist()
y_train data set distribution

K-fold cross-validation to pick classifier models

Use the following classification models.

models = []
#: Decision Tree model
models.append(('CART', DecisionTreeClassifier()))
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('AB', AdaBoostClassifier()))
models.append(('GBM', GradientBoostingClassifier()))
models.append(('RF', RandomForestClassifier()))
models.append(('ET', ExtraTreesClassifier()))

Then, use K-fold cross-validation to pick out the best performing classifier.

num_folds = 4
seed = 7
scoring = 'accuracy'
results = []
names = []
for name, model in models:
#kfold = KFold(n_splits=num_folds, random_state=seed)
cv_results = cross_val_score(model, X_train, y_train, cv=5, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
CART: 0.638021 (0.006344)
LR: 0.623427 (0.001539)
KNN: 0.571838 (0.010391)
AB: 0.622451 (0.001639)
GBM: 0.634401 (0.006262)
RF: 0.637618 (0.004874)
ET: 0.638021 (0.006344)
various classifier algorithms for the BOGO data

Refinement

Although most models, including GradientBoostingClassifier(GBM) and Random Forest (FR), performed with similar results, we can choose the Decision Tree (CART) as the classifier for this model.
The CART model is a simpler model than those ensemble models. The model can learn the nonlinear relationships between the variables and can describe feature importance. Also, the CART model provides the feature importance.

Let’s use the Grid Search technique to turn the hyperparameter to get the best hyperparameters to predict the test dataset.

For the Decision Tree algorithm there are the following hyperparameters we can tune:

class_weight
criterion
max_depth
max_features
max_leaf_nodes
min_impurity_decrease
min_impurity_split
min_samples_leaf
min_samples_split
min_weight_fraction_leaf
presort
random_state
splitter

We will turn criterion, max_depth, min_samples_leaf, min_samples_split hyperparameters to find the best performing Decision Tree classifier model.

param_dict = {
'criterion':['gini','entropy'],
'max_depth': range(1,20),
'min_samples_split': range(2,10),
'min_samples_leaf': range(1,5)
}

Model evaluation and validation

After using Grid Search, we found the following hyperparameter gives the best result for the training dataset.

grid_result.best_params_{'criterion': 'gini',
'max_depth': 10,
'min_samples_leaf': 1,
'min_samples_split': 2}

The best performing hyperparameter values for criterion gini, max_depth, min_samples_leaf and min_samples_split are gini, 10, 1 and 2, respectively.

Justification

grid_result.best_score_0.637

The accuracy metric using the default Decision Tree was 0.638. The metric after Grid Search was 0.637 which is slightly less than 0.638 but very close. The 0.637 result value was computed using the best performing model and the model was chosen after fitting 10 folds for each of 1216 candidates using the Grid Search. That being said the model seems very robust model displaying the very consistent results.

Predict using the Test dataset

#: build the model with best hyperparameters chosen
model = DecisionTreeClassifier(criterion='gini', max_depth=10, min_samples_leaf=1, min_samples_split=2)
#: fit the model to the training data
model.fit(X_train, y_train)
#: make predictions on the test data
y_pred = model.predict(X_test)
#: Check the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
0.6438

Conclusion for the BOGO model

Reflection

The Decision Tree classification model’s accuracy using the test data is 64% which is not that great. However, the result is better than a random model with a baseline accuracy of 50%. The fact that the similar accuracy metrics between the train and the test data results indicating that the model is not overfitted. Also the model satisfies the objectives we defined in beginning. No low-latency and interpretability of the model.

Let’s interpreate the model thru the feature’s importance.

feature_importance = pd.Series(data=model.feature_importances_, index=X.columns)
feature_importance.sort_values(ascending=False).head(15).plot(kind='bar')
BOGO classifier feature importance

For the income category, the most important one was Less than 45K (income_cat_1). It is the lowest income category available. The second most important feature for BOGO was gender_Findicating female customers.

In conclusion, we built a classification model to predict which customers would complete BOGO offers if they receive those offers. The model predicts that BOGO offers have more chance to be completed by females with relatively lower-income customers with 64% accuracy.

Build Models using the discount data

With similar steps as BOGO, we can build a classifier for the discount offer data.

various classifier algorithms for the discount data

Like the BOGO data, we can pick the Decision Tree (CART) classifier and picked the best performing hyperparameters using Grid Search and K-fold to predict the test dataset for the discount offers.

#: pick the best model
model = DecisionTreeClassifier(criterion='entropy', max_depth=6, min_samples_leaf=4, min_samples_split=2)
#: fit the model to the training data
model.fit(X_train, y_train)
#: make predictions on the test data
y_pred = model.predict(X_test)
#: Check the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
0.700

Conclusion for the discount model

Reflection

The Decision Tree classification model’s accuracy using the test data is 70% which is better than the BOGO data result.

Let’s examine the feature’s importance for the classification.

discount offer classifier feature importance

The social_1 feature has the highest importance. It indicates that the discount promotional offers that were delivered by social media have a good chance of completing the offer. In other words, discount promotional offers through social media platforms are more effective than those via non-social media platforms. The second most important feature for discount offers was Less than 45K (income_cat_1). Like BOGO offers, the income level feature is one of the key features for discount offers.

Improvement

We’d like to improve the model’s accuracy by making better features. We could collect more attributes of customers to have more feature variables to improve our model.

Another improvement would be splitting the Train/Test data based on the Time Series features. If we have more evenly distributed time-series features such as, Become_Member_Onwe could use the time-based splitting technique. But the time-based data are not evenly distributed so we ended up dropping the column.

--

--