Credit card fraud detection 2 – using Restricted Boltzmann Machine in TensorFlow

In my previous post, I have demo-ed how to use Autoencoder for credit card fraud detection and achieved an AUC score of 0.94. This time, I will be exploring another model – Restricted Boltzmann Machine – as well as its detailed implementation and results in tensorflow.

RBM was one of the earliest models introduced in the world of deep learning. There have been many successful use cases of RBM in areas such as dimensionality reduction, classification, collaborative filtering, feature learning as well as anomaly detection.

In this tutorial, we will also use the unsupervised way – without feeding labels to model – to train on our data, and will achieve a slightly improved result over auto-encoder!

The whole post is divided into three parts below:

  1. Introduction to RBM
  2. Implementation in TensorFlow 
  3. Results and interpretation 

All codes can be found in github.

Overview of the data set

I will be using the exact same credit card data here. So please go to my previous post if you would like to see more about it. Also, you can download the data from Kaggle here if you want.

RBM – a quick introduction

There are many good online resources that offer either brief or in-depth explanation of RBM:

  1. https://www.youtube.com/watch?v=FJ0z3Ubagt4 & https://www.youtube.com/watch?v=p4Vh_zMw-HQ – Best youtube explanations to RBM in my opinion.
  2. https://deeplearning4j.org/restrictedboltzmannmachine – Gives a very intuitive and easy-to-understand ways of RBM.
  3. http://deeplearning.net/tutorial/rbm.html – Theories + Theano implementation

Basically, a RBM is network consists of two layers – the Visible Layer and Hidden layer. There are symmetric connections between any pair of nodes from Visible and Hidden layers, and no connections within each layer.

For majority cases, both hidden and visible layers are binary-valued. There are also extensions with visible layer being Gaussian, and hidden layer being Bernoulli. The latter will be our case for fraud detection. (Since our input data will be normalized into mean of 0 and std of 1)

Screen Shot 2017-07-27 at 5.25.15 PM.png

When signals propagate from visible to hidden, the input layer (i.e. the data sample) will be multiplied by the matrix W, added with bias vector b of hidden layer, and finally go through the sigmoid function to be squashed to be within 0 and 1, which are also the probabilities for each hidden neuron to be on. However, it is very important to keep the hidden states binary (0 or 1 for each), rather than using the probabilities itself*. And only during the last update of the gibbs sampling should we use probabilities for hidden layer, which we will talk about later on.

During backward pass, or reconstruction, the hidden layer activation will become the input, which is multiplied by the same matrix W, added with visible biases, and then will either go through the sigmoid function (for Bernoulli visible), or being sampled from a multivariate Gaussian distribution (for Gaussian visible), as below:

Capture.PNG

Intuitively, we can understand that the model is adjusting its weights, during training, such that it could best approximate the training data distribution p with its reconstruction distribution q, as below:

Capture.PNG

We first define our so-called Energy Function E(x, h)as well as the joint probability p(x, h)given any pair of visible and hidden layers, as below:

Capture.PNG

In the above equations, x is the visible layer, h is the hidden layer activations, W, b and c are the matrix, hidden bias and visible bias, respectively.

So basically, for any pair of x and h, we are able to calculate E(x, h). Its value is a scalar. And the higher the value of Energy, the lower the p(x, h).

1) How to detect fraud with RBM?

From the energy function, we are able to derive the equation below for so-called Free Energy of x:

Capture.PNG

This Free Energy, F(x), which is also a scalar, is exactly what we will need for test data, from which we will use the distribution to detect anomalies. The higher the free energy, the higher the chance of x being a fraud. 

2) How to update model parameters?

To update our parameters W, b and c, we use below equations combined with SGD (the alpha is learning rate):

Capture.PNG

The left part, h(x^{(t)})x^{(t)^T}, is easy to calculate, as x(t) is just the training sample, and h(x(t)) is simply below:

Capture.PNG

So the outcome is simply a vector multiplication which gives same shape as W.

Okay, that’s easy! But how to calculate h(\widetilde{x})\widetilde{x}^{T}? The answer is to use gibbs sampling:

Capture.PNG

We start with a training sample x(t), and sample each hj by first calculating:

Capture.PNG

Then we draw a value from a uniform distribution [0, 1], and if the drawn value is smaller than the one calculated above, we assign 1 to hj, otherwise we assign 0. And then we do this for each h

Next, we sample visible layer xk, by using previously sampled hidden layer as input, with a similar equation below:

Capture.PNG

Here, we directly use the results from sigmoid , without sampling it into states of 0 and 1, to get visible layer.

However, we will do this step slightly different if the input data, or visible layer x, is a Gaussian distribution. If Gaussian, we will sample the x vector using Gaussian with mean μ = c + WTh as well as identity covariance matrix. This part is fully implemented in the code where you can check for verification (under equation sample_visible_from_hidden).

We will do this k steps, which is referred to as Contrastive Divergence, or CDk. 

After last step k, we will use the sampled visible layer as \widetilde{x}, together with the last hidden probabilities as h(\widetilde{x}). Please note that what we are using here is the probabilities, not the sampled 0 and 1 states, for h(\widetilde{x})

In summary, the whole process looks like this:

  • Start with training sample x – x^{(t)}
  • Sample h from input x            – h(x^{(t)})
  • Sample x from h
  • Sample h from x
  • Sample x from h                       – \widetilde{x}
  • Sample h from x                       – h(\widetilde{x})

In practice, using k = 1 can give a good result.

Until here, we have covered the whole process of model updating.

3) How to tune hyper-parameters?

We will stick to data-driven approach.

Split the data into training and validation sets, and train the model on training set, while evaluate the performance on validation.

Start the hidden layer with a smaller dimension than input layer(e.g. 5, 10), set the learning rate to be a small value (such as 0.001), monitor the validation data set reconstruction error (not the actual error against labels)

The reconstruction error is basically the mean squared of the difference between predicted \widetilde{x} and the actual data x, averaged over the entire mini-batch. 

If the reconstruction error stops decreasing, that would be a sign for early-stopping.

However, a comprehensive guide to tuning RBM is fully covered in Geoffrey Hinton’s notes,  which you are encouraged to take a look at.

Coding the RBM

The code was modified from here, which was an excellent implementation in TensorFlow. I only added in a few changes to its implementation:

  1. Implemented Momentum for faster convergence.
  2. Added in L2 regularisation.
  3. Added in methods for retrieving Free Energy as well as Reconstruction Error in validation data.
  4. Simplified the code a bit by removing parts of tf summaries (originally not compatible with tf version 1.1 above)
  5. Added in a bit utilities such as plotting training loss

Basically, the code is a sklearn – style RBM class that you can directly use to train and predict.

Training and Results

1) Training and validation

We split our data by transaction time into training and validation by 50 – 50, and train our model on the training set.

TEST_RATIO = 0.50

df.sort_values('Time', inplace = True)
TRA_INDEX = int((1-TEST_RATIO) * df.shape[0])
train_x = df.iloc[:TRA_INDEX, 1:-2].values
train_y = df.iloc[:TRA_INDEX, -1].values

test_x = df.iloc[TRA_INDEX:, 1:-2].values
test_y = df.iloc[TRA_INDEX:, -1].values

After we train the model, we will calculate the Free Energy of val set, and visualize the distributions for both fraud as well as non-fraud samples.

2) Data pre-processing 

Since the data are already PCA transformed, we will only need to standardize them with z-score to get mean of 0 and standard deviation of 1, as below:

cols_mean = []
cols_std = []
for c in range(train_x.shape[1]):
   cols_mean.append(train_x[:,c].mean())
   cols_std.append(train_x[:,c].std())
   train_x[:, c] = (train_x[:, c] - cols_mean[-1]) / cols_std[-1]
   test_x[:, c] = (test_x[:, c] - cols_mean[-1]) / cols_std[-1]

Please note that you need to calculate statistics using training set only, as I did above, instead of on the full data set (training and val combined).

After that, we will fit the data using model with Gaussian visible layer. (This is the Gaussian – Bernoulli RBM, since the hidden layer is still binary-valued)

3) Visualization of results

It is clearly seen that fraud data has Free Energy much more uniformly distributed than non-fraud data.

hist_non_fraud

hist_fraud.png

If you calculate the AUC score (Area Under the ROC Curve) on val set, you will get a score around 0.96!

4) Real time application

To enable it as real time fraud detector, we will need to find a threshold based on validation data set. This can be done by trading off the precision & recall curves as below (e.g. a value of 100 might give a relative good balance):

precision_2recall_2

5) A further interpretation at the val AUC score of 0.96

There’s another way to intuitively look at the AUC score:

The val data’s fraud percentage is around 0.16%;

For example, if we choose top 500 transactions ranked by model’s free energy predictions, the number of fraud is around 11.82% …

So, precision increases from 0.16% to 11.82% in the top 500 …

Again, please find in github for details on code implementation and notebook.

 

*References

 

Exercises: 

Try to use Reconstruction Error in place of Free Energy as fraud score, and compare the result to using Free Energy. This is similar to what I did previously for autoencoder tutorial. 

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close