In my previous post, I have demo-ed how to use Autoencoder for credit card fraud detection and achieved an AUC score of 0.94. This time, I will be exploring another model – Restricted Boltzmann Machine – as well as its detailed implementation and results in tensorflow.
RBM was one of the earliest models introduced in the world of deep learning. There have been many successful use cases of RBM in areas such as dimensionality reduction, classification, collaborative filtering, feature learning as well as anomaly detection.
In this tutorial, we will also use the unsupervised way – without feeding labels to model – to train on our data, and will achieve a slightly improved result over auto-encoder!
The whole post is divided into three parts below:
- Introduction to RBM
- Implementation in TensorFlow
- Results and interpretation
All codes can be found in github.
Overview of the data set
I will be using the exact same credit card data here. So please go to my previous post if you would like to see more about it. Also, you can download the data from Kaggle here if you want.
RBM – a quick introduction
There are many good online resources that offer either brief or in-depth explanation of RBM:
- https://www.youtube.com/watch?v=FJ0z3Ubagt4 & https://www.youtube.com/watch?v=p4Vh_zMw-HQ – Best youtube explanations to RBM in my opinion.
- https://deeplearning4j.org/restrictedboltzmannmachine – Gives a very intuitive and easy-to-understand ways of RBM.
- http://deeplearning.net/tutorial/rbm.html – Theories + Theano implementation
Basically, a RBM is network consists of two layers – the Visible Layer and Hidden layer. There are symmetric connections between any pair of nodes from Visible and Hidden layers, and no connections within each layer.
For majority cases, both hidden and visible layers are binary-valued. There are also extensions with visible layer being Gaussian, and hidden layer being Bernoulli. The latter will be our case for fraud detection. (Since our input data will be normalized into mean of 0 and std of 1)
When signals propagate from visible to hidden, the input layer (i.e. the data sample) will be multiplied by the matrix W, added with bias vector b of hidden layer, and finally go through the sigmoid function to be squashed to be within 0 and 1, which are also the probabilities for each hidden neuron to be on. However, it is very important to keep the hidden states binary (0 or 1 for each), rather than using the probabilities itself*. And only during the last update of the gibbs sampling should we use probabilities for hidden layer, which we will talk about later on.
During backward pass, or reconstruction, the hidden layer activation will become the input, which is multiplied by the same matrix W, added with visible biases, and then will either go through the sigmoid function (for Bernoulli visible), or being sampled from a multivariate Gaussian distribution (for Gaussian visible), as below:
Intuitively, we can understand that the model is adjusting its weights, during training, such that it could best approximate the training data distribution p with its reconstruction distribution q, as below:
We first define our so-called Energy Function E(x, h), as well as the joint probability p(x, h), given any pair of visible and hidden layers, as below:
In the above equations, x is the visible layer, h is the hidden layer activations, W, b and c are the matrix, hidden bias and visible bias, respectively.
So basically, for any pair of x and h, we are able to calculate E(x, h). Its value is a scalar. And the higher the value of Energy, the lower the p(x, h).
1) How to detect fraud with RBM?
From the energy function, we are able to derive the equation below for so-called Free Energy of x:
This Free Energy, F(x), which is also a scalar, is exactly what we will need for test data, from which we will use the distribution to detect anomalies. The higher the free energy, the higher the chance of x being a fraud.
2) How to update model parameters?
To update our parameters W, b and c, we use below equations combined with SGD (the alpha is learning rate):
The left part, , is easy to calculate, as x(t) is just the training sample, and h(x(t)) is simply below:
So the outcome is simply a vector multiplication which gives same shape as W.
Okay, that’s easy! But how to calculate ? The answer is to use gibbs sampling:
We start with a training sample x(t), and sample each hj by first calculating:
Then we draw a value from a uniform distribution [0, 1], and if the drawn value is smaller than the one calculated above, we assign 1 to hj, otherwise we assign 0. And then we do this for each hj
Next, we sample visible layer xk, by using previously sampled hidden layer as input, with a similar equation below:
Here, we directly use the results from sigmoid , without sampling it into states of 0 and 1, to get visible layer.
However, we will do this step slightly different if the input data, or visible layer x, is a Gaussian distribution. If Gaussian, we will sample the x vector using Gaussian with mean μ = c + WTh as well as identity covariance matrix. This part is fully implemented in the code where you can check for verification (under equation sample_visible_from_hidden).
We will do this k steps, which is referred to as Contrastive Divergence, or CDk.
After last step k, we will use the sampled visible layer as , together with the last hidden probabilities as . Please note that what we are using here is the probabilities, not the sampled 0 and 1 states, for .
In summary, the whole process looks like this:
- Start with training sample x –
- Sample h from input x –
- Sample x from h
- Sample h from x
- Sample x from h –
- Sample h from x –
In practice, using k = 1 can give a good result.
Until here, we have covered the whole process of model updating.
3) How to tune hyper-parameters?
We will stick to data-driven approach.
Split the data into training and validation sets, and train the model on training set, while evaluate the performance on validation.
Start the hidden layer with a smaller dimension than input layer(e.g. 5, 10), set the learning rate to be a small value (such as 0.001), monitor the validation data set reconstruction error (not the actual error against labels)
The reconstruction error is basically the mean squared of the difference between predicted and the actual data x, averaged over the entire mini-batch.
If the reconstruction error stops decreasing, that would be a sign for early-stopping.
However, a comprehensive guide to tuning RBM is fully covered in Geoffrey Hinton’s notes, which you are encouraged to take a look at.
Coding the RBM
The code was modified from here, which was an excellent implementation in TensorFlow. I only added in a few changes to its implementation:
- Implemented Momentum for faster convergence.
- Added in L2 regularisation.
- Added in methods for retrieving Free Energy as well as Reconstruction Error in validation data.
- Simplified the code a bit by removing parts of tf summaries (originally not compatible with tf version 1.1 above)
- Added in a bit utilities such as plotting training loss
Basically, the code is a sklearn – style RBM class that you can directly use to train and predict.
Training and Results
1) Training and validation
We split our data by transaction time into training and validation by 50 – 50, and train our model on the training set.
TEST_RATIO = 0.50 df.sort_values('Time', inplace = True) TRA_INDEX = int((1-TEST_RATIO) * df.shape) train_x = df.iloc[:TRA_INDEX, 1:-2].values train_y = df.iloc[:TRA_INDEX, -1].values test_x = df.iloc[TRA_INDEX:, 1:-2].values test_y = df.iloc[TRA_INDEX:, -1].values
After we train the model, we will calculate the Free Energy of val set, and visualize the distributions for both fraud as well as non-fraud samples.
2) Data pre-processing
Since the data are already PCA transformed, we will only need to standardize them with z-score to get mean of 0 and standard deviation of 1, as below:
cols_mean =  cols_std =  for c in range(train_x.shape): cols_mean.append(train_x[:,c].mean()) cols_std.append(train_x[:,c].std()) train_x[:, c] = (train_x[:, c] - cols_mean[-1]) / cols_std[-1] test_x[:, c] = (test_x[:, c] - cols_mean[-1]) / cols_std[-1]
Please note that you need to calculate statistics using training set only, as I did above, instead of on the full data set (training and val combined).
After that, we will fit the data using model with Gaussian visible layer. (This is the Gaussian – Bernoulli RBM, since the hidden layer is still binary-valued)
3) Visualization of results
It is clearly seen that fraud data has Free Energy much more uniformly distributed than non-fraud data.
If you calculate the AUC score (Area Under the ROC Curve) on val set, you will get a score around 0.96!
4) Real time application
To enable it as real time fraud detector, we will need to find a threshold based on validation data set. This can be done by trading off the precision & recall curves as below (e.g. a value of 100 might give a relative good balance):
5) A further interpretation at the val AUC score of 0.96
There’s another way to intuitively look at the AUC score:
The val data’s fraud percentage is around 0.16%;
For example, if we choose top 500 transactions ranked by model’s free energy predictions, the number of fraud is around 11.82% …
So, precision increases from 0.16% to 11.82% in the top 500 …
Again, please find in github for details on code implementation and notebook.
Try to use Reconstruction Error in place of Free Energy as fraud score, and compare the result to using Free Energy. This is similar to what I did previously for autoencoder tutorial.