### Github scripts

The ipython notebook has been uploaded into github – free feel to jump there directly if you want to skip the explanations.

## Introduction

In this post, we will be exploring data set of credit card transactions, and try to build an **unsupervised** machine learning model which is able to tell whether a particular transaction is fraud or genuine.

By unsupervised – the model will be trained with **both** positive and negative data (i.e. frauds and non-frauds), **without** providing the labels. Since we have much more normal transactions than fraudulent ones, we should expect the model to learn and memorize the patterns of normal ones after training, and should be able to give a score for any transaction as being an outlier. And this unsupervised training would be quite useful in practice especially when we do not have enough labeled data set.

I will start with some data exploration, then walk you through the basics of Auto-encoder, and finally talk about implementation in tensorflow.

This data set comes from the public platform Kaggle dataset – https://www.kaggle.com/dalpozz/creditcardfraud.

## The data

### 1. A quick look

There are quite a few online scripts on Kaggle that have done a great job exploring and visualising the dataset, including detailed histogram plots for each feature. Therefore I won’t delve into too much detailed here. Please feel free to explore those if you are interested to know more about the data set.

In summary, the data set has** 284,807** transactions over a 48-hour period back in 2013 Sept. Our of all, **0.17%** are fraud and the rest are genuine, which makes it pretty much **unbalanced**. Besides, all data are **labelled**.

We are challenged to build a model that predicts whether a transaction is fraud or not, given the features. And the features are 29 in total, 28 of which have been **PCA transformed** due to confidentiality reason, which means they are not in their original forms and therefore we won’t be able to do much feature engineering about them. One more feature is transactional amount.

Here’s a quick look at the top rows:

To make it simple for our task, we are **not using** the last column of ‘**Amount**‘as feature. (It is not being shown here in the above)

### 2. Train/test split

I split our data based on the Time column. (We use earlier data as training and later ones as test)

Since this is a very unbalanced data, I am using the first 75% as training and validation data, and the later 25% as test, based on Time column. This is just to ensure we won’t have a test set that contains too few of positive cases (in case you want to give 90 – 10 split, for example).

### 3. Data standardization and activation functions

I have considered two types of standardization here – **z-score** and **min-max** **scaling**.

The first one will normalize each column into having mean of zero and standardization of ones, which will be good choice if we are using some sort of output functions like tanh, that outputs values on both sides of zero. Besides, this will leave values that are too extreme to still keep some extremeness left after normalization (e.g. to have more than 2 standard deviations away). This might be useful to detect outliers in this case.

The second min-max approach will ensure all values to be within 0 ~ 1. All positive. This is the default approach if we are using sigmoid as our output activation.

Just to recap the differences between sigmoid and tanh below (sigmoid will squash the values into range between (0, 1); whereas tanh, or hyperbolic tangent, squash them into (-1, 1)):

I used validation set to decide for the data standardization approach as well as activation functions. Based on experiments, I found tanh to perform better than sigmoid, when using together with z-score normalization. Therefore, I chose **tanh** followed by** z-score. **

To summarize the steps in code, we have the data preparation as below:

TEST_RATIO = 0.25 df.sort_values('Time', inplace = True) TRA_INDEX = int((1-TEST_RATIO) * df.shape[0]) train_x = df.iloc[:TRA_INDEX, 1:-2].values train_y = df.iloc[:TRA_INDEX, -1].values test_x = df.iloc[TRA_INDEX:, 1:-2].values test_y = df.iloc[TRA_INDEX:, -1].values # z-score normalization cols_mean = [] cols_std = [] for c in range(train_x.shape[1]): cols_mean.append(train_x[:,c].mean()) cols_std.append(train_x[:,c].std()) train_x[:, c] = (train_x[:, c] - cols_mean[-1]) / cols_std[-1] test_x[:, c] = (test_x[:, c] - cols_mean[-1]) / cols_std[-1]

And here, the **train_x** and **test_x** will be our data that we will feed into model later.

## Auto-encoder

### What is an auto-encoder?

Auto-encoder is one type of neural networks that approximates the function: f(x) = x. Basically, given an input x, network will learn to output f(x) that is as close as to x. The error between output and x is commonly measured using root mean square error (RMSE) – mean((f(x) – x) ^ 2) – which is the loss function we try to minimise in our network.

An auto-encoder looks like one below. It follows a typical feed-forward neural networks architecture except that the output layer has exactly same number of neurons as input layer. And it uses the input data itself as its target. Therefore it works in a way of unsupervised learning – learn without predicting an actual label.

The lower part of the network shown below is usually called an ‘encoder’ – whose job is to ’embed’ the input data into a lower dimensional array. The upper part of network, or ‘decoder’, will try to decode the embedding array into the original one.

We can have either one hidden layer, or in the case below, have multiple layers depending on the complexity of our features.

### Auto-encoder for anomaly detection.

We rely on auto-encoder to ‘learn’ and ‘memorize’ the common patterns that are shared by the majority training data. And during reconstruction, the RMSE will be high for the data who do not conform to those patterns. And these are the ‘anomalies’ we are detecting. And hopefully, these ‘anomalies’ are also equal to the ‘fraudulent’ transactions we are after.

**During prediction** – 1) We can select a threshold for RMSE based on validation data and flag all data with RMSE above the threshold as fraudulent. – 2) Alternatively, if we believe 0.1% of all transactions are fraudulent, we can also rank the data based on reconstruction error for each data (i.e. the RMSEs), then select the top 0.1% to be the frauds.

**Evaluation metric **– We will evaluate our model’s performance using AUC score on test data set.

## Build the model in tensorflow

### 1. Build the graph

*Scripts modified from github tensorflow tutorial. *

# Parameters learning_rate = 0.01 training_epochs = 6 batch_size = 256 # Network Parameters n_hidden_1 = 15 # 1st layer num features n_hidden_2 = 5 # 2nd layer num features n_input = train_x.shape[1] # 28 here as our feature dimension X = tf.placeholder("float", [None, n_input]) weights = { 'encoder_h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])), 'encoder_h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])), 'decoder_h1': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_1])), 'decoder_h2': tf.Variable(tf.random_normal([n_hidden_1, n_input])), } biases = { 'encoder_b1': tf.Variable(tf.random_normal([n_hidden_1])), 'encoder_b2': tf.Variable(tf.random_normal([n_hidden_2])), 'decoder_b1': tf.Variable(tf.random_normal([n_hidden_1])), 'decoder_b2': tf.Variable(tf.random_normal([n_input])), } # Building the encoder def encoder(x): # Encoder Hidden layer with tanh activation #1 layer_1 = tf.nn.tanh(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1'])) # Decoder Hidden layer with tanh activation #2 layer_2 = tf.nn.tanh(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2'])) return layer_2 # Building the decoder def decoder(x): # Encoder Hidden layer with tanh activation #1 layer_1 = tf.nn.tanh(tf.add(tf.matmul(x, weights['decoder_h1']), biases['decoder_b1'])) # Decoder Hidden layer with tanh activation #2 layer_2 = tf.nn.tanh(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2'])) return layer_2 # Construct model encoder_op = encoder(X) decoder_op = decoder(encoder_op) # Prediction y_pred = decoder_op # Targets (Labels) are the input data. y_true = X # Define batch mse batch_mse = tf.reduce_mean(tf.pow(y_true - y_pred, 2), 1) # Define raw error layer batch_error_layer = y_true - y_pred # Define loss and optimizer, minimize the squared error cost = tf.reduce_mean(tf.pow(y_true - y_pred, 2)) optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(cost)

*X* is the placeholder for our input data. *Weights* and *biases* (the *W*s and *b*s of neural networks) contain all parameters of the network that we will learn to optimise. Since first and second layers contain 15 and 5 neurons respectively, we are building a network of such architecture: 28(input) -> 15 -> 5 -> 15 -> 28(output).

The activation functions for each layer used is *tanh*, as I explained earlier. The objective function here – or the *cost* as above – measures the total RMSE of our predicted and input arrays **in** **one batch **–** **which means** **it’s** a scalar**. We then run the *optimizer* every time we want to do a batch update.

However, we have another *batch_mse* here will return RMSEs for **each input data in a batch **– which is** a vector** of length that equals to number of rows in input data. These will be the predicted values – or fraud scores if you want to call it – for the input (be it training, validation or test data), which we can extract out after prediction.

### 2. Train the model

## TRAIN STARTS BELOW save_model = os.path.join(data_dir, 'temp_saved_model.ckpt') saver = tf.train.Saver() # Initializing the variables init = tf.global_variables_initializer() with tf.Session() as sess: now = datetime.now() sess.run(init) total_batch = int(train_x.shape[0]/batch_size) # Training cycle for epoch in range(training_epochs): # Loop over all batches for i in range(total_batch): batch_idx = np.random.choice(train_x.shape[0], batch_size) batch_xs = train_x[batch_idx] # Run optimization op (backprop) and cost op (to get loss value) _, c = sess.run([optimizer, cost], feed_dict={X: batch_xs}) # Display logs per epoch step if epoch % display_step == 0: train_batch_mse = sess.run(batch_mse, feed_dict={X: train_x}) print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), "Train auc=", "{:.6f}".format(auc(train_y, train_batch_mse)), "Time elapsed=", "{}".format(datetime.now() - now)) print("Optimization Finished!") save_path = saver.save(sess, save_model) print("Model saved in file: %s" % save_path)

The training part above is straight forward. Every time we randomly sample a mini batch of size 256 from *train_x*, feed into model as input of X, and run the *optimizer *to update the parameters through SGD.

However, one thing worth highlighting here – we are using the **same data for training as well as for validation**! This is reflected in the line of:

if epoch % display_step == 0: train_batch_mse = sess.run(batch_mse, feed_dict={X: train_x}) print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), "Train auc=", "{:.6f}".format(auc(train_y, train_batch_mse)), "Time elapsed=", "{}".format(datetime.now() - now))

This may seem counter-intuitive in the beginning, but since we are doing unsupervised training and the model never ‘see’ the labels during training, this will not lead to overfitting. This validation process is used for monitoring ‘early stopping’ as well as model hyper-parameter tuning. The AUC score we have obtained for valuation on *train_x* is around **0.95**.

Eventually, after we have finalized our model and hyper-parameters, we can evaluate its performance on our separate *test_x* data set, which is shown in codes below (*test_batch_mse *is our fraud scores for test data) :

save_model = os.path.join(data_dir, 'temp_saved_model.ckpt') saver = tf.train.Saver() # Initializing the variables init = tf.global_variables_initializer() with tf.Session() as sess: now = datetime.now() saver.restore(sess, save_model) test_batch_mse = sess.run(batch_mse, feed_dict={X: test_x}) print("Test auc score: {}".format(auc(test_y, test_batch_mse)))

And yes, we obtained a test AUC of around **0.944**.

### 3. (What’s more) Using auto-encoder as a data pre-processing step

Until now we have covered all the necessary steps to train an auto-encoder and make predictions on test data. However, if you are interested to continue, the next thing I will do is ‘pre-train’ our data set from scratch using auto-encoder, fetch out the embedding layers, and feed those embeddings to a FC feed forward neural network that will do the task of binary classification.

The rationale is simple – since our auto-encoder is able to differentiate between frauds and non-frauds, the lower dimensional features it’s derived (the embedding layer) during training should include some useful latent features that would help the task of fraud classification. Or at least, it should speed up classifier’s learning process, compared to letting it adapt to the raw features from scratch.

First, let’s fetch out the embedding layer for all data set. It is the *encoder_op* ops which we can get by using sess.run.

save_model = os.path.join(data_dir, 'temp_saved_model_1layer.ckpt') saver = tf.train.Saver() # Initializing the variables init = tf.global_variables_initializer() with tf.Session() as sess: now = datetime.now() saver.restore(sess, save_model) test_encoding = sess.run(encoder_op, feed_dict={X: test_x}) train_encoding = sess.run(encoder_op, feed_dict={X: train_x}) print("Dim for test_encoding and train_encoding are: \n", test_encoding.shape, '\n', train_encoding.shape)

Second, we build the graph for FC feed-forward neural network as follows (again, you could use validation to fine tune hyper-parameters such as hidden layer numbers and sizes):

#n_input = test_encoding.shape[1] n_input = test_encoding.shape[1] hidden_size = 4 output_size = 2 X = tf.placeholder(tf.float32, [None, n_input], name='input_x') y_ = tf.placeholder(tf.int32, shape=[None, output_size], name='target_y') weights = { 'W1': tf.Variable(tf.truncated_normal([n_input, hidden_size])), 'W2': tf.Variable(tf.truncated_normal([hidden_size, output_size])), } biases = { 'b1': tf.Variable(tf.zeros([hidden_size])), 'b2': tf.Variable(tf.zeros([output_size])), } hidden_layer = tf.nn.relu(tf.add(tf.matmul(X, weights['W1']), biases['b1'])) pred_logits = tf.add(tf.matmul(hidden_layer, weights['W2']), biases['b2']) pred_probs = tf.nn.softmax(pred_logits) cross_entropy = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=pred_logits)) optimizer = tf.train.AdamOptimizer(2e-4).minimize(cross_entropy)

Then, we will prepare our data set, and train the model while monitoring its validation scores. Here, we will further split our ** train_encoding **into

**and**

*train_enc_x***, each taking up**

*val_enc_x***80%**and

**20%**of the previous

**, respectively. As a typical supervised training approach, we will use**

*train_encoding**train_enc_x*as our training data and

*val_enc_x*as validation.

n_epochs = 70 batch_size = 256 # PREPARE DATA VAL_PERC = 0.2 all_y_bin = np.zeros((df.shape[0], 2)) all_y_bin[range(df.shape[0]), df['Class'].values] = 1 train_enc_x = train_encoding[:int(train_encoding.shape[0] * (1-VAL_PERC))] train_enc_y = all_y_bin[:int(train_encoding.shape[0] * (1-VAL_PERC))] val_enc_x = train_encoding[int(train_encoding.shape[0] * (1-VAL_PERC)):] val_enc_y = all_y_bin[int(train_encoding.shape[0] * (1-VAL_PERC)):train_encoding.shape[0]] test_enc_y = all_y_bin[train_encoding.shape[0]:] print("Num of data for train, val and test are: \n{}, \n{}, \n{}".format(train_enc_x.shape[0], val_enc_x.shape[0], \ test_encoding.shape[0])) # TRAIN STARTS save_model = os.path.join(data_dir, 'temp_saved_model_FCLayers.ckpt') saver = tf.train.Saver() # Initializing the variables init = tf.global_variables_initializer() with tf.Session() as sess: now = datetime.now() sess.run(init) total_batch = int(train_enc_x.shape[0]/batch_size) # Training cycle for epoch in range(n_epochs): # Loop over all batches for i in range(total_batch): batch_idx = np.random.choice(train_enc_x.shape[0], batch_size) batch_xs = train_enc_x[batch_idx] batch_ys = train_enc_y[batch_idx] # Run optimization op (backprop) and cost op (to get loss value) _, c = sess.run([optimizer, cross_entropy], feed_dict={X: batch_xs, y_: batch_ys}) # Display logs per epoch step if epoch % display_step == 0: val_probs = sess.run(pred_probs, feed_dict={X: val_enc_x}) print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), "Val auc=", "{:.6f}".format(auc(val_enc_y[:, 1], val_probs[:, 1])), "Time elapsed=", "{}".format(datetime.now() - now)) print("Optimization Finished!") save_path = saver.save(sess, save_model) print("Model saved in file: %s" % save_path)

And evaluate on the same test set as previously:

save_model = os.path.join(data_dir, 'temp_saved_model_FCLayers.ckpt') saver = tf.train.Saver() # Initializing the variables init = tf.global_variables_initializer() with tf.Session() as sess: now = datetime.now() saver.restore(sess, save_model) test_probs = sess.run(pred_probs, feed_dict={X: test_encoding}) print("Test auc score: {}".format(auc(test_enc_y[:, 1], test_probs[:, 1])))

Again our AUC score further improved to **0.9556**!

And yeah, please refer to the jupyter notebook for detailed version of code.

This is awesome, thank you!

Please could you provide an example of making a single prediction?

So for example, if I were to send in a single credit card transaction, how would I obtain the reconstruction error for one sample?

Many thanks!

LikeLike

Hello,

That actually is the same as what I did to predict test_x above, which is to say your test_x now is just a valuation set with sample size = 1.

So you need to pre-process your single transaction (using z-score), and then use something like below to predict:

save_model = os.path.join(data_dir, ‘temp_saved_model.ckpt’)

saver = tf.train.Saver()

init = tf.global_variables_initializer()

with tf.Session() as sess:

saver.restore(sess, save_model)

test_batch_mse = sess.run(batch_mse, feed_dict={X: test_x})

LikeLike

Thanks for your help, super useful!

LikeLike

Great post. Very helpful. Was wondering if you can help clarify something. Can I use high MSE values from the auto-encoder validation step as indicators for anomalies? If yes, can you help explain why the data with high MSE value do not seem to correspond to the actual fraudulent entries in the test dataset?

David

LikeLike

Thanks for your question!

Yes, you can and that’s exactly what I did as follow:

batch_mse = tf.reduce_mean(tf.pow(y_true – y_pred, 2), 1)

The batch_mse will be the mse for each test data, same length as your input, which can be used for indicators for anomalies.

As for your second one, the reason being the percentage of fraud in test set is extremely small – around 0.1%, so the AUC score of 0.95 simply means that if you sort the test data based on batch_mse in descending order, in the top 500 cases, you would probably see around 10% of them being fraud, which increases from sample mean of 0.1% by a lot.

You can check my Credit Card Fraud Detection 2 for more details on the analysis part at the bottom –

https://weiminwang.blog/2017/08/05/credit-card-fraud-detection-2-using-restricted-boltzmann-machine-in-tensorflow/

LikeLike

Has the model converged?

LikeLike

Hi Weimin,

When I ran your code I got an error at the last step where you evaluate the AUC on the test set for the supervised algorithm. This is the error message I got:

NotFoundError (see above for traceback): Key Variable_30 not found in checkpoint

[[Node: save_41/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device=”/job:localhost/replica:0/task:0/device:CPU:0″](_arg_save_41/Const_0_0, save_41/RestoreV2/tensor_names, save_41/RestoreV2/shape_and_slices)]]

Do you know what this means?

LikeLike

It means when you tried to load the model after training, you may have modified the graph so that Variable_30 was a new variable which was suddenly added, that had not been there (or was not named so) during your previous training

LikeLike

Hey mate, I think you split the train data incorrectly. Shouldn’t you split the train data to consist of “NON-Fraud” data only first? I mean from what I’ve understood that is how you train Autoencoders right.

LikeLike

This is not a mistake, but rather on purpose. I did so just to show that in reality even if we have some fraud data in our training, we still can train a descent auto-encoder to differentiate fraud from non-fraud in test set. Because in reality, we often can’t get pure non-fraud training data in the first place.

LikeLike