In my previous post, I have demo-ed how to use Autoencoder for credit card fraud detection and achieved an AUC score of 0.94. This time, I will be exploring another model – Restricted Boltzmann Machine – as well as its detailed implementation and results in tensorflow.

RBM was one of the earliest models introduced in the world of deep learning. There have been many successful use cases of RBM in areas such as dimensionality reduction, classification, collaborative filtering, feature learning as well as anomaly detection.

In this tutorial, we will also use the **unsupervised way** – without feeding labels to model – to train on our data, and will achieve a slightly improved result over auto-encoder!

The whole post is divided into three parts below:

**Introduction to RBM****Implementation in TensorFlow****Results and interpretation**

All codes can be found in github.

### Overview of the data set

I will be using the exact same credit card data here. So please go to my previous post if you would like to see more about it. Also, you can download the data from Kaggle here if you want.

## RBM – a quick introduction

There are many good online resources that offer either brief or **in-depth** explanation of RBM:

- https://www.youtube.com/watch?v=FJ0z3Ubagt4 & https://www.youtube.com/watch?v=p4Vh_zMw-HQ – Best youtube explanations to RBM in my opinion.
- https://deeplearning4j.org/restrictedboltzmannmachine – Gives a very intuitive and easy-to-understand ways of RBM.
- http://deeplearning.net/tutorial/rbm.html – Theories + Theano implementation

Basically, a RBM is network consists of two layers – the Visible Layer and Hidden layer. There are symmetric connections between any pair of nodes from Visible and Hidden layers, and no connections within each layer.

For majority cases, both hidden and visible layers are binary-valued. There are also extensions with visible layer being **Gaussian**, and hidden layer being **Bernoulli**. The latter will be our case for fraud detection. (Since our input data will be normalized into mean of 0 and std of 1)

When signals propagate from visible to hidden, the input layer (i.e. the data sample) will be multiplied by the matrix ** W**, added with bias vector

**of hidden layer, and finally go through the sigmoid function to be squashed to be within 0 and 1, which are also the probabilities for each hidden neuron to be on.**

*b***However, it is very important to keep the hidden states binary (0 or 1 for each), rather than using the probabilities itself*. And only during the last update of the gibbs sampling should we use probabilities for hidden layer,**which we will talk about later on.

During backward pass, or reconstruction, the hidden layer activation will become the input, which is multiplied by the **same matrix W**, added with visible

**biases**, and then will either go through the

**sigmoid**function (for Bernoulli visible), or being sampled from a multivariate

**Gaussian**distribution (for Gaussian visible), as below:

Intuitively, we can understand that the model is adjusting its weights, during training, such that it could best approximate the **training data distribution *** p *with its

**reconstruction distribution**, as below:

*q*We first define our so-called **Energy Function E(x, h), **as well as the

**joint probability**given any pair of visible and hidden layers, as below:

*p(x, h)*,In the above equations,** x** is the visible layer,

**is the hidden layer activations,**

*h***and**

*W, b***are the matrix, hidden bias and visible bias, respectively.**

*c*So basically, for any pair of ** x** and

**, we are able to calculate**

*h***E(**. Its value is a scalar. And the higher the value of Energy, the lower the

*x*,*h*)*.*

**p(x, h)**### 1) How to detect fraud with RBM?

From the energy function, we are able to derive the equation below for so-called **Free Energy of x**:

This Free Energy, * F(x)*, which is also a scalar, is exactly what we will need for

**test data**, from which we will use the distribution to detect anomalies.

**The higher the free energy, the higher the chance of x being a fraud.**

### 2) How to update model parameters?

To update our parameters * W, b* and

*, we use below equations combined with SGD (the alpha is learning rate):*

**c**The left part, , is easy to calculate, as ** x^{(t)}** is just the training sample, and

*is simply below:*

**h(x**^{(t)})So the outcome is simply a vector multiplication which gives **same shape** as * W*.

Okay, that’s easy! But how to calculate ? The answer is to use **gibbs sampling**:

We start with a training sample * x(t)*, and sample each

**by first calculating:**

*h*_{j}Then we draw a value from a **uniform distribution [0, 1]**, and if the drawn value is smaller than the one calculated above, we assign 1 to* h_{j}*, otherwise we assign 0. And then we do this for each

*h*_{j }Next, we sample visible layer ** x_{k, }**by using previously sampled

**hidden layer as input**, with a similar equation below:

Here, we directly use the results from **sigmoid** , **without** **sampling** it into states of 0 and 1, to get visible layer.

However, we will do this step slightly different if the input data, or visible layer **x, **is a **Gaussian distribution**. If Gaussian, we will sample the * x* vector using

**Gaussian**with

**mean**

*as well as identity covariance matrix. This part is fully implemented in the code where you can check for verification (under equation*

**μ = c + W**^{T}h**).**

*sample_visible_from_hidden*We will do this k steps, which is referred to as **Contrastive Divergence**, or **CD _{k. }**

After last step k, we will use the sampled visible layer as , together with the **last hidden probabilities **as **. **Please note that what we are using here is the **probabilities**,** not the sampled 0 and 1 states, for . **

In summary, the whole process looks like this:

- Start with training sample x –
- Sample h from input x –
- Sample x from h
- Sample h from x
- …
- Sample x from h –
- Sample h from x –

In practice, using k = 1 can give a good result.

Until here, we have covered the whole process of model updating.

### 3) How to tune hyper-parameters?

We will stick to data-driven approach.

Split the data into **training** and **validation** sets, and train the model on training set, while evaluate the performance on validation.

Start the hidden layer with a smaller dimension than input layer(e.g. 5, 10), set the learning rate to be a small value (such as 0.001), monitor the validation data set **reconstruction error** (not the actual error against labels)

The reconstruction error is basically the **mean squared** of the difference between predicted and the actual data **x, averaged over the entire mini-batch. **

If the reconstruction error stops decreasing, that would be a sign for early-stopping.

However, a comprehensive guide to tuning RBM is fully covered in Geoffrey Hinton’s notes, which you are encouraged to take a look at.

**Coding the RBM**

The code was modified from here, which was an excellent implementation in TensorFlow. I only added in a few changes to its implementation:

- Implemented
**Momentum**for faster convergence. - Added in
**L2**regularisation. - Added in methods for retrieving
**Free Energy**as well as**Reconstruction Error**in validation data. - Simplified the code a bit by
**removing**parts of**tf****summaries**(originally not compatible with tf version 1.1 above) - Added in a bit utilities such as plotting training loss

Basically, the code is a **sklearn – style RBM class **that you can directly use to train and predict.

## Training and Results

### 1) Training and validation

We split our data **by transaction time** into training and validation by **50 – 50**, and train our model on the training set.

TEST_RATIO = 0.50 df.sort_values('Time', inplace = True) TRA_INDEX = int((1-TEST_RATIO) * df.shape[0]) train_x = df.iloc[:TRA_INDEX, 1:-2].values train_y = df.iloc[:TRA_INDEX, -1].values test_x = df.iloc[TRA_INDEX:, 1:-2].values test_y = df.iloc[TRA_INDEX:, -1].values

After we train the model, we will calculate the **Free Energy** of val set, and visualize the distributions for both fraud as well as non-fraud samples.

**2) Data pre-processing **

Since the data are already PCA transformed, we will only need to standardize them with **z-score** to get mean of 0 and standard deviation of 1, as below:

cols_mean = [] cols_std = [] for c in range(train_x.shape[1]): cols_mean.append(train_x[:,c].mean()) cols_std.append(train_x[:,c].std()) train_x[:, c] = (train_x[:, c] - cols_mean[-1]) / cols_std[-1] test_x[:, c] = (test_x[:, c] - cols_mean[-1]) / cols_std[-1]

Please note that you need to calculate statistics using** training set only**, as I did above, instead of on the full data set (training and val combined).

After that, we will fit the data using model with **Gaussian visible layer**. (This is the **Gaussian – Bernoulli RBM**, since the hidden layer is still binary-valued)

### 3) Visualization of results

It is clearly seen that fraud data has Free Energy **much more uniformly distributed** than non-fraud data.

If you calculate the AUC score (Area Under the ROC Curve) on val set, you will get a score around **0.96**!

### 4) Real time application

To enable it as **real time fraud detector**, we will need to find a threshold based on validation data set. This can be done by **trading off the precision & recall** curves as below (e.g. a value of 100 might give a relative good balance):

### 5) A further interpretation at the val AUC score of 0.96

There’s another way to intuitively look at the AUC score:

The val data’s fraud percentage is around 0.16%;

For example, if we choose top 500 transactions ranked by model’s free energy predictions, the number of fraud is around 11.82% …

So, precision increases from 0.16% to 11.82% in the top 500 …

Again, please find in github for details on code implementation and notebook.

## *References

- https://www.youtube.com/watch?v=FJ0z3Ubagt4
- https://www.youtube.com/watch?v=p4Vh_zMw-HQ
- https://deeplearning4j.org/restrictedboltzmannmachine
- http://deeplearning.net/tutorial/rbm.html
- https://gist.github.com/blackecho/db85fab069bd2d6fb3e7
- https://www.researchgate.net/publication/262277311_Network_anomaly_detection_with_the_restricted_Boltzmann_machine
- https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf
- http://proceedings.mlr.press/v9/marlin10a/marlin10a.pdf
- https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine

*Exercises: *

*Try to use Reconstruction Error in place of Free Energy as fraud score, and compare the result to using Free Energy. This is similar to what I did previously for autoencoder tutorial. *

The visualization shows that normal energy and fraud energry are different. However, it does not make sense in case unsupervised learning because you make the energy distribution graph by using label data column test_y ==1. So, If we don’t use test_y, how can we decide a cut off point of fraud and normal energy?

LikeLike

You are supposed to collect validation data for threshold selection. In reality, you can also work with your fraud ops team to create feedback loop for the task.

LikeLike

Thanks for the article and sample code.

Can I re-use your code?

I understand that your code was derived from Gabriele Angeletti’s

or

https://github.com/blackecho/Deep-Learning-TensorFlow

Can you confirm that it is also released under the MIT license?

LikeLike

sure. Feel free to reuse it

LikeLike

Why do you compute positive = tf.matmul(tf.transpose(visible), hidden_states) if visible_unit is binary. I think the right formula is positive=tf.matmul(tf.transpose(visible), hidden_probs).

LikeLike

Hey, do you know how you would code in Persistent Contrast Divergence instead of Contrast Divergence?

LikeLike

Just wondering – is the implementation of CD actually PCD ?

LikeLike

How do you calculate ROC or AUC when the test_cost only provides the free energy and not a 0 or 1 classification??

this is the exact code line

“fpr, tpr, _ = roc_curve(test_y, test_cost)”

where

test_cost = model.getFreeEnergy(test_x).reshape(-1)

can you explain to me in this case how do you calculate the FPR and such ??

LikeLike

To calculate AUC you only need scores / predicted probabilities (test_cost) so that you can rank the data. And you will need ground truth binary labels (0 or 1) which is provided in test_y. You can’t convert test_cost into 0 or 1 to calculate AUC, you need the raw scores. Please refer to online / Google for AUC calculation details.

LikeLike