Credit card fraud detection 1 – using auto-encoder in TensorFlow

Github scripts

The ipython notebook has been uploaded into github – free feel to jump there directly if you want to skip the explanations.

Introduction

In this post, we will be exploring data set of credit card transactions, and try to build an unsupervised machine learning model which is able to tell whether a particular transaction is fraud or genuine.

By unsupervised – the model will be trained with both positive and negative data (i.e. frauds and non-frauds), without providing the labels. Since we have much more normal transactions than fraudulent ones, we should expect the model to learn and memorize the patterns of normal ones after training, and should be able to give a score for any transaction as being an outlier. And this unsupervised training would be quite useful in practice especially when we do not have enough labeled data set.

I will start with some data exploration, then walk you through the basics of Auto-encoder, and finally talk about implementation in tensorflow.

This data set comes from the public platform Kaggle dataset – https://www.kaggle.com/dalpozz/creditcardfraud.

The data

1. A quick look

There are quite a few online scripts on Kaggle that have done a great job exploring and visualising the dataset, including detailed histogram plots for each feature. Therefore I won’t delve into too much detailed here. Please feel free to explore those if you are interested to know more about the data set.

In summary, the data set has 284,807 transactions over a 48-hour period back in 2013 Sept. Our of all, 0.17% are fraud and the rest are genuine, which makes it pretty much unbalanced. Besides, all data are labelled.

We are challenged to build a model that predicts whether a transaction is fraud or not, given the features. And the features are 29 in total, 28 of which have been PCA transformed due to confidentiality reason, which means they are not in their original forms and therefore we won’t be able to do much feature engineering about them. One more feature is transactional amount.

Here’s a quick look at the top rows:

head

To make it simple for our task, we are not using the last column of ‘Amount‘as feature. (It is not being shown here in the above)

2. Train/test split

I split our data based on the Time column. (We use earlier data as training and later ones as test)

Since this is a very unbalanced data, I am using the first 75% as training and validation data, and the later 25% as test, based on Time column. This is just to ensure we won’t have a test set that contains too few of positive cases (in case you want to give 90 – 10 split, for example).

3. Data standardization and activation functions

I have considered two types of standardization here – z-score and min-max scaling.

The first one will normalize each column into having mean of zero and standardization of ones, which will be good choice if we are using some sort of output functions like tanh, that outputs values on both sides of zero. Besides, this will leave values that are too extreme to still keep some extremeness left after normalization (e.g. to have more than 2 standard deviations away). This might be useful to detect outliers in this case.

The second min-max approach will ensure all values to be within 0 ~ 1. All positive. This is the default approach if we are using sigmoid as our output activation.

Just to recap the differences between sigmoid and tanh below (sigmoid will squash the values into range between (0, 1); whereas tanh, or hyperbolic tangent, squash them into (-1, 1)):

tanh and sigmoid.PNG

I used validation set to decide for the data standardization approach as well as activation functions. Based on experiments, I found tanh to perform better than sigmoid, when using together with z-score normalization. Therefore, I chose tanh followed by z-score. 

To summarize the steps in code, we have the data preparation as below:

TEST_RATIO = 0.25
df.sort_values('Time', inplace = True)
TRA_INDEX = int((1-TEST_RATIO) * df.shape[0])
train_x = df.iloc[:TRA_INDEX, 1:-2].values
train_y = df.iloc[:TRA_INDEX, -1].values

test_x = df.iloc[TRA_INDEX:, 1:-2].values
test_y = df.iloc[TRA_INDEX:, -1].values

# z-score normalization
cols_mean = []
cols_std = []
for c in range(train_x.shape[1]):
    cols_mean.append(train_x[:,c].mean())
    cols_std.append(train_x[:,c].std())
    train_x[:, c] = (train_x[:, c] - cols_mean[-1]) / cols_std[-1]
    test_x[:, c] = (test_x[:, c] - cols_mean[-1]) / cols_std[-1]

And here, the train_x and test_x will be our data that we will feed into model later.

Auto-encoder

What is an auto-encoder?

Auto-encoder is one type of neural networks that approximates the function: f(x) = x. Basically, given an input x, network will learn to output f(x) that is as close as to x. The error between output and x is commonly measured using root mean square error (RMSE) – mean((f(x) – x) ^ 2) – which is the loss function we try to minimise in our network.

An auto-encoder looks like one below. It follows a typical feed-forward neural networks architecture except that the output layer has exactly same number of neurons as input layer. And it uses the input data itself as its target. Therefore it works in a way of unsupervised learning – learn without predicting an actual label.

The lower part of the network shown below is usually called an ‘encoder’ – whose job is to ’embed’ the input data into a lower dimensional array. The upper part of network, or ‘decoder’, will try to decode the embedding array into the original one.

We can have either one hidden layer, or in the case below, have multiple layers depending on the complexity of our features.

stackedAE.png

Auto-encoder for anomaly detection.

We rely on auto-encoder to ‘learn’ and ‘memorize’ the common patterns that are shared by the majority training data. And during reconstruction, the RMSE will be high for the data who do not conform to those patterns. And these are the ‘anomalies’ we are detecting. And hopefully, these ‘anomalies’ are also equal to the ‘fraudulent’ transactions we are after.

During prediction – 1) We can select a threshold for RMSE based on validation data and flag all data with RMSE above the threshold as fraudulent. – 2) Alternatively, if we believe 0.1% of all transactions are fraudulent, we can also rank the data based on reconstruction error for each data (i.e. the RMSEs), then select the top 0.1% to be the frauds.

Evaluation metric – We will evaluate our model’s performance using AUC score on test data set.

Build the model in tensorflow

1. Build the graph

Scripts modified from github tensorflow tutorial

# Parameters
learning_rate = 0.01
training_epochs = 6
batch_size = 256

# Network Parameters
n_hidden_1 = 15 # 1st layer num features
n_hidden_2 = 5 # 2nd layer num features
n_input = train_x.shape[1] # 28 here as our feature dimension

X = tf.placeholder("float", [None, n_input])

weights = {
    'encoder_h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
    'encoder_h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
    'decoder_h1': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_1])),
    'decoder_h2': tf.Variable(tf.random_normal([n_hidden_1, n_input])),
}
biases = {
    'encoder_b1': tf.Variable(tf.random_normal([n_hidden_1])),
    'encoder_b2': tf.Variable(tf.random_normal([n_hidden_2])),
    'decoder_b1': tf.Variable(tf.random_normal([n_hidden_1])),
    'decoder_b2': tf.Variable(tf.random_normal([n_input])),
}

# Building the encoder
def encoder(x):
    # Encoder Hidden layer with tanh activation #1
    layer_1 = tf.nn.tanh(tf.add(tf.matmul(x, weights['encoder_h1']),
              biases['encoder_b1']))
    # Decoder Hidden layer with tanh activation #2
    layer_2 = tf.nn.tanh(tf.add(tf.matmul(layer_1, weights['encoder_h2']),
              biases['encoder_b2']))
    return layer_2

# Building the decoder
def decoder(x):
    # Encoder Hidden layer with tanh activation #1
    layer_1 = tf.nn.tanh(tf.add(tf.matmul(x, weights['decoder_h1']),
            biases['decoder_b1']))
    # Decoder Hidden layer with tanh activation #2
    layer_2 = tf.nn.tanh(tf.add(tf.matmul(layer_1, weights['decoder_h2']),
            biases['decoder_b2']))
    return layer_2

# Construct model
encoder_op = encoder(X)
decoder_op = decoder(encoder_op)

# Prediction
y_pred = decoder_op
# Targets (Labels) are the input data.
y_true = X

# Define batch mse
batch_mse = tf.reduce_mean(tf.pow(y_true - y_pred, 2), 1)

# Define raw error layer
batch_error_layer = y_true - y_pred

# Define loss and optimizer, minimize the squared error
cost = tf.reduce_mean(tf.pow(y_true - y_pred, 2))
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(cost)

X is the placeholder for our input data. Weights and biases (the Ws and bs of neural networks) contain all parameters of the network that we will learn to optimise. Since first and second layers contain 15 and 5 neurons respectively, we are building a network of such architecture: 28(input) -> 15 -> 5 -> 15 -> 28(output).

The activation functions for each layer used is tanh, as I explained earlier. The objective function here – or the cost as above – measures the total RMSE of our predicted and input arrays in one batch  which means it’s a scalar. We then run the optimizer every time we want to do a batch update.

However, we have another batch_mse here will return RMSEs for each input data in a batch – which is a vector of length that equals to number of rows in input data. These will be the predicted values – or fraud scores if you want to call it – for the input (be it training, validation or test data), which we can extract out after prediction.

2. Train the model

## TRAIN STARTS BELOW
save_model = os.path.join(data_dir, 'temp_saved_model.ckpt')
saver = tf.train.Saver()

# Initializing the variables
init = tf.global_variables_initializer()

with tf.Session() as sess:
    now = datetime.now()
    sess.run(init)
    total_batch = int(train_x.shape[0]/batch_size)
    # Training cycle
    for epoch in range(training_epochs):
        # Loop over all batches
        for i in range(total_batch):
            batch_idx = np.random.choice(train_x.shape[0], batch_size)
            batch_xs = train_x[batch_idx]
            # Run optimization op (backprop) and cost op (to get loss value)
            _, c = sess.run([optimizer, cost], feed_dict={X: batch_xs})

        # Display logs per epoch step
        if epoch % display_step == 0:
            train_batch_mse = sess.run(batch_mse, feed_dict={X: train_x})
            print("Epoch:", '%04d' % (epoch+1),
                  "cost=", "{:.9f}".format(c),
                  "Train auc=", "{:.6f}".format(auc(train_y, train_batch_mse)),
                  "Time elapsed=", "{}".format(datetime.now() - now))

    print("Optimization Finished!")

    save_path = saver.save(sess, save_model)
    print("Model saved in file: %s" % save_path)

The training part above is straight forward. Every time we randomly sample a mini batch of size 256 from train_x, feed into model as input of X, and run the optimizer to update the parameters through SGD.

However, one thing worth highlighting here – we are using the same data for training as well as for validation! This is reflected in the line of:

if epoch % display_step == 0:
    train_batch_mse = sess.run(batch_mse, feed_dict={X: train_x})
    print("Epoch:", '%04d' % (epoch+1),
          "cost=", "{:.9f}".format(c),
          "Train auc=", "{:.6f}".format(auc(train_y, train_batch_mse)),
          "Time elapsed=", "{}".format(datetime.now() - now))

This may seem counter-intuitive in the beginning, but since we are doing unsupervised training and the model never ‘see’ the labels during training, this will not lead to overfitting. This validation process is used for monitoring ‘early stopping’ as well as model hyper-parameter tuning. The AUC score we have obtained for valuation on train_x is around 0.95.

Eventually, after we have finalized our model and hyper-parameters, we can evaluate its performance on our separate test_x data set, which is shown in codes below (test_batch_mse is our fraud scores for test data) :

save_model = os.path.join(data_dir, 'temp_saved_model.ckpt')
saver = tf.train.Saver()
# Initializing the variables
init = tf.global_variables_initializer()
with tf.Session() as sess:
    now = datetime.now()
    saver.restore(sess, save_model)
    test_batch_mse = sess.run(batch_mse, feed_dict={X: test_x})
    print("Test auc score: {}".format(auc(test_y, test_batch_mse)))

And yes, we obtained a test AUC of around 0.944.

3. (What’s more) Using auto-encoder as a data pre-processing step

Until now we have covered all the necessary steps to train an auto-encoder and make predictions on test data. However, if you are interested to continue, the next thing I will do is ‘pre-train’ our data set from scratch using auto-encoder, fetch out the embedding layers, and feed those embeddings to a FC feed forward neural network that will do the task of binary classification.

The rationale is simple – since our auto-encoder is able to differentiate between frauds and non-frauds, the lower dimensional features it’s derived (the embedding layer) during training should include some useful latent features that would help the task of fraud classification. Or at least, it should speed up classifier’s learning process, compared to letting it adapt to the raw features from scratch.

First, let’s fetch out the embedding layer for all data set. It is the encoder_op ops which we can get by using sess.run.

save_model = os.path.join(data_dir, 'temp_saved_model_1layer.ckpt')
saver = tf.train.Saver()

# Initializing the variables
init = tf.global_variables_initializer()

with tf.Session() as sess:
now = datetime.now()
saver.restore(sess, save_model)

test_encoding = sess.run(encoder_op, feed_dict={X: test_x})
train_encoding = sess.run(encoder_op, feed_dict={X: train_x})

print("Dim for test_encoding and train_encoding are: \n", test_encoding.shape, '\n', train_encoding.shape)

Second, we build the graph for FC feed-forward neural network as follows (again, you could use validation to fine tune hyper-parameters such as hidden layer numbers and sizes):

#n_input = test_encoding.shape[1]
n_input = test_encoding.shape[1]

hidden_size = 4
output_size = 2

X = tf.placeholder(tf.float32, [None, n_input], name='input_x')
y_ = tf.placeholder(tf.int32, shape=[None, output_size], name='target_y')

weights = {
'W1': tf.Variable(tf.truncated_normal([n_input, hidden_size])),
'W2': tf.Variable(tf.truncated_normal([hidden_size, output_size])),
}
biases = {
'b1': tf.Variable(tf.zeros([hidden_size])),
'b2': tf.Variable(tf.zeros([output_size])),
}

hidden_layer = tf.nn.relu(tf.add(tf.matmul(X, weights['W1']), biases['b1']))
pred_logits = tf.add(tf.matmul(hidden_layer, weights['W2']), biases['b2'])
pred_probs = tf.nn.softmax(pred_logits)

cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=pred_logits))

optimizer = tf.train.AdamOptimizer(2e-4).minimize(cross_entropy)

Then, we will prepare our data set, and train the model while monitoring its validation scores. Here, we will further split our train_encoding into train_enc_x and val_enc_x, each taking up 80% and 20% of the previous train_encoding , respectively. As a typical supervised training approach, we will use train_enc_x as our training data and val_enc_x as validation.

n_epochs = 70
batch_size = 256

# PREPARE DATA
VAL_PERC = 0.2
all_y_bin = np.zeros((df.shape[0], 2))
all_y_bin[range(df.shape[0]), df['Class'].values] = 1

train_enc_x = train_encoding[:int(train_encoding.shape[0] * (1-VAL_PERC))]
train_enc_y = all_y_bin[:int(train_encoding.shape[0] * (1-VAL_PERC))]

val_enc_x = train_encoding[int(train_encoding.shape[0] * (1-VAL_PERC)):]
val_enc_y = all_y_bin[int(train_encoding.shape[0] * (1-VAL_PERC)):train_encoding.shape[0]]

test_enc_y = all_y_bin[train_encoding.shape[0]:]
print("Num of data for train, val and test are: \n{}, \n{}, \n{}".format(train_enc_x.shape[0], val_enc_x.shape[0], \
test_encoding.shape[0]))

# TRAIN STARTS
save_model = os.path.join(data_dir, 'temp_saved_model_FCLayers.ckpt')
saver = tf.train.Saver()

# Initializing the variables
init = tf.global_variables_initializer()

with tf.Session() as sess:
    now = datetime.now()
    sess.run(init)
    total_batch = int(train_enc_x.shape[0]/batch_size)
    # Training cycle
    for epoch in range(n_epochs):
        # Loop over all batches
        for i in range(total_batch):
            batch_idx = np.random.choice(train_enc_x.shape[0], batch_size)
            batch_xs = train_enc_x[batch_idx]
            batch_ys = train_enc_y[batch_idx]

            # Run optimization op (backprop) and cost op (to get loss value)
            _, c = sess.run([optimizer, cross_entropy], feed_dict={X: batch_xs, y_: batch_ys})

        # Display logs per epoch step
        if epoch % display_step == 0:
            val_probs = sess.run(pred_probs, feed_dict={X: val_enc_x})
            print("Epoch:", '%04d' % (epoch+1),
                  "cost=", "{:.9f}".format(c),
                  "Val auc=", "{:.6f}".format(auc(val_enc_y[:, 1], val_probs[:, 1])),
                  "Time elapsed=", "{}".format(datetime.now() - now))

    print("Optimization Finished!")

    save_path = saver.save(sess, save_model)
    print("Model saved in file: %s" % save_path)

And evaluate on the same test set as previously:

save_model = os.path.join(data_dir, 'temp_saved_model_FCLayers.ckpt')
saver = tf.train.Saver()
# Initializing the variables

init = tf.global_variables_initializer()

with tf.Session() as sess:
    now = datetime.now()
    saver.restore(sess, save_model)
    test_probs = sess.run(pred_probs, feed_dict={X: test_encoding})
    print("Test auc score: {}".format(auc(test_enc_y[:, 1], test_probs[:, 1])))

Again our AUC score further improved to 0.9556!

And yeah, please refer to the jupyter notebook for detailed version of code.

6 thoughts on “Credit card fraud detection 1 – using auto-encoder in TensorFlow

  1. bonkersabouttech August 2, 2017 — 3:58 pm

    This is awesome, thank you!

    Please could you provide an example of making a single prediction?

    So for example, if I were to send in a single credit card transaction, how would I obtain the reconstruction error for one sample?

    Many thanks!

    Like

    1. Hello,

      That actually is the same as what I did to predict test_x above, which is to say your test_x now is just a valuation set with sample size = 1.

      So you need to pre-process your single transaction (using z-score), and then use something like below to predict:

      save_model = os.path.join(data_dir, ‘temp_saved_model.ckpt’)
      saver = tf.train.Saver()

      init = tf.global_variables_initializer()

      with tf.Session() as sess:
      saver.restore(sess, save_model)
      test_batch_mse = sess.run(batch_mse, feed_dict={X: test_x})

      Like

      1. Thanks for your help, super useful!

        Like

  2. Great post. Very helpful. Was wondering if you can help clarify something. Can I use high MSE values from the auto-encoder validation step as indicators for anomalies? If yes, can you help explain why the data with high MSE value do not seem to correspond to the actual fraudulent entries in the test dataset?

    David

    Like

    1. Thanks for your question!

      Yes, you can and that’s exactly what I did as follow:

      batch_mse = tf.reduce_mean(tf.pow(y_true – y_pred, 2), 1)

      The batch_mse will be the mse for each test data, same length as your input, which can be used for indicators for anomalies.

      As for your second one, the reason being the percentage of fraud in test set is extremely small – around 0.1%, so the AUC score of 0.95 simply means that if you sort the test data based on batch_mse in descending order, in the top 500 cases, you would probably see around 10% of them being fraud, which increases from sample mean of 0.1% by a lot.

      You can check my Credit Card Fraud Detection 2 for more details on the analysis part at the bottom –
      https://weiminwang.blog/2017/08/05/credit-card-fraud-detection-2-using-restricted-boltzmann-machine-in-tensorflow/

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close