The new challenge.
Let’s say you are a data scientist approached by your Digital Marketing team to explore the possibilities of applying Machine Learning model on the data – specifically, there are millions of records of promotional emails your marketing team has sent out in past, which contain data on products, customers, email details and so on, as well as an indicator to say whether or not an email has been opened (or links inside clicked on).
Applying some regression models could definitely help identify some of the potential factors that affect the clicking rate. However, the reality is, little of the logistic regression analysis has actually churned out any useful actionable points to integrate into decision making. What the marketing people want, in their mind, is how to ‘tweak’ their campaigns so that email engagement with customers will be maximized. Telling them that a male customer has higher chance to open an email than a female customer is not going to be very helpful.
And this is where machine learning comes to save the world. Even though there is no algorithm which can ‘learn’ to draft an perfect email which maximizes the chance of opening for every customer, what we need now is just a probability, or a number, which tells us how likely your customer is going to open your very next email. With the probabilities for all your recipients, it gives the chance to marketing to forecast the fate of your next campaign, to properly differentiate among your customers, and even target on them separately with different marketing channels.
What Machine Learning can provide to us.
Prediction-enabled multi-channel marketing
Right before you start the next campaign, run Machine Learning model and make predictions on each individual customer, and identify those who are most likely to open your promotional emails (green above), and those who are less (blue) or highly unlikely (red) to do so.
Once you have segmented the recipients, target on them with different strategies. Send emails to the ‘green’ right away – they have a much higher chance to respond to your email!
For those in the ‘red’ region – think about what other channels may apply on them? For example, you can run another Machine Learning model using data from different channels (e.g. sales calls), and identify ones with high probabilities from that channel to target differently.
And the ‘blue’ one – this is an interesting (and tricky) region. You can either send emails to some of them, or use other channels as well. This region is flexible, and you can be more rigorous by squeezing it and give more space for either ‘red’ or ‘green’, thus increase the percentage of either channels.
Estimate total responses to meet your marketing target.
Next step is to estimate the total number of opens for each segment (calculated based on your predicted probabilities), and to see if sending emails only to ‘green’, which gives you around 1008 opens (estimated), will actually meet your target this month.
One thing to note: The thresholds above (0.50 and 0.12) are just used as example – you can increase the area of the ‘green’ in order to get more opens from the top tier segment, simply by decreasing the threshold of 0.5 above to the left.
Dig deeper into those high / low potential customers.
The marketing team are not satisfied with the probabilities only. They would like to see more of the details of the customers in those ‘red’ and ‘green’ regions. Specifically, they want to know more about, for those high potential customers for example, how they are distributed, what titles, departments, ages, etc they are and things like that.
These can be easily done by mapping those selected customers into the DB, and visualize the details using the software of your choice. For example, the geo-location distribution of customers from the ‘green’ region can be visualized as below:
And you can even fine tune your customers’ selection further into the cities level (on top of your selection of ‘green’ segment)
Recommend other customers to send emails to beyond your selected pool of 20,000 recipients.
Since you have only selected 20,000 customers you would like to send emails to, why not let the machine recommend you more beyond your selection? This sounds a great idea! Simply make predictions to the entire customer pool you have (maybe you have an entire potential customer pool of 100,000 across the whole country), and select the top 1,000 (based on your choice) based on probabilities ranking who are outside your selected target of 20,000.
Update the model after each campaign.
After each campaign conducted, you will get new data loaded into your database with feedback on each email. These will become part of your new training data for making your next prediction. This way, you will be able to constantly update your model, catch up with the new trend and hopefully make your next forecast more accurately.
What’s the takeaway?
The philosophy here is to “send the right message to the right customer”. You don’t want to keep spamming your customer base with mass-email strategy and sending irrelevant emails all the time, because it will decrease their overall email responding rate in the long run. (Think about any website you have subscribed to, once you have kept receiving a few irrelevant emails from them, do you feel less likely to open the next email from them when you receive again? How do you even perceive with that brand image?)
Of course, above is just an incomplete list of things Machine Learning model can promise you on your email campaign. There are more that can be done by using the state-of-the-art algorithms and open-source libraries way beyond what has been discussed. Next, we will talk more about, in technical details, how we can start with EDA (Exploratory Data Analysis), prepare data and engineer features, and validate your models and make predictions.
The Machine Learning (technical) way
EDAs are good starting point
We still need to start with some EDA plots. This is mainly to help us visualize the data. Further, this will help us better come up with ideas of feature engineerings when we want to train our model later on.
“Looks like customers prefer to open domestic emails …”
“This correlates to our previous discussion! The more they receive our emails, the less likely they will tend to open a new one. “
Feature engineerings require extra time and efforts. However, all the creativity and hard work invested here will eventually pay off.
We normally start with basic ones – one hot or integer encoding for all categorical features – such as city names, customer titles or departments, email domains, or product and brand names, or even campaign themes – we either convert them to an integer that machine can understand, or to a sparse encoding, depending on the cardinality of the feature.
NLP (natural language processing) features are useful – things like Bags of words, tf-idf, key-word statistics, or even topic modelling which can help a lot in increasing the prediction accuracy, or even bring up business values. The raw data which can be used here include for example email titles and contents.
From time series, we can extract ‘day’, ‘month’, ‘week’, ‘hour’, ‘minute’ and so on, to feed into our model. This relates to the date-time of each email that arrived at customer’s mail box.
And most importantly, the ‘customer behaviour’ features, such as number of products each customer has purchased previously, length of days they have been our subscribers, number of emails they received from us, their receiving frequency, what the percentage of emails clicked on and percentage not, and so on.
As a summary, there are mainly three categories of features mentioned above – product, customer and email related features. However, the ideas discussed only serve as starting point, which are far from an exhaustive list. On the other hand, feature types also vary according to types of data or tables that can be retrieved from DB. It is highly advised that as data scientist, you should sit with your IT or BI teams, and have a heart-to-heart discussion to full understand what data that exist that are relevant to build models, what are available and un-available information, and how they could possibly help you to build up the views and tables which would make your life easier.
Platform wise – with proper sparse matrix handling in python, your 16GB laptop can easily bear with data sets that have millions of rows and thousands of columns. Once you have too large a data set that couldn’t fit into a single machine, your next choice would be using a distributed system such as Spark.
Modeling and Evaluation
Now you have created your training data – a (sparse) matrix containing millions of rows and thousands (or hundreds) of columns. The next thing to do is split out a small portion of the data which you will use as ‘test’ or ‘validation’ set. Most straight way is to split by date of each email sent – select the latest say 10,000 emails (if your latest campaign has broadcast out 10,000 emails in total) out of all the training data, use them as ‘test’, and use the remaining emails as ‘training’.
A good Machine Learning is the key to success. Eventually, it is all up to your prediction accuracy. What should be your first choice is XGBoost – extreme gradient boosting – or at least equivalent one as your main workhorse for building predictive model. This XGBoost has the advantages of being accurate, fast and highly scalable. It handles missing values and sparse matrix extremely well. For such binary email classification problem, it can easily produce a decent result (as long as your data are reliable). The details about XGBoost can be found here – https://xgboost.readthedocs.io/en/latest/
And for the evaluation metric – if you have much less opened emails than unopened ones (which is normally the case), or the data set is highly skewed, you need to think about using AUC score, or other equivalent ones. AUC is a score normally between 0.5 (random guess) to 1.0 (perfect prediction), and insensitive to skewed data set.
Alternatively, if you find it too difficult to explain (or hard to justify whether AUC score in 0.8 means your model is good enough or not), use a simple 2 x 2 confusion matrix.
Confusion matrix is easier to explain to users, but on the other hand, finding the proper threshold to convert probabilities into actual ‘open’ or ‘unopen’ will take nontrivial time and efforts.
At this stage of today, we haven’t really taken the ‘Artificial’ part out of AI. This means feature engineerings, model tuning and validation are still part of our job as Data Scientist. However, as we have seen, machine learning has already enabled us to do what we couldn’t before – forecast. With a cutting edge algorithm and carefully crafted features from the data, we could easily reach the prediction accuracy say AUC score of 0.9 or above.
This bit of ‘Intelligence’ will help the marketing team not only ‘send the right message to the right person’, but also boost their ability to understand their customers’ behaviors in a deeper manner, properly segment their customers base, and eventually enable the data driven decision making.