The Amazing Effectiveness of Sequence to Sequence Model for Time Series

In this tutorial, I am excited to showcase examples of building Time Series forecasting model with seq2seq in TensorFlow. The purpose of this post is to give an intuitive as well as technical understanding of the implementations, and to demonstrate the two useful features under the hood:

  1. Multivariate input and output signals
  2. Variable input and output sequence lengths

After that, I will demonstrate how to apply what we have covered on something very interesting – to predict extreme events / outliers. 

Finally, we will end this tutorial by applying what we learnt into a real world case – forecasting Beijing pm2.5 pollution.

At any time, please feel free to jump to python notebook at my github if you want to skip reading.

Organisational structures:

  1. Univariate case
  2. Multivariate case
  3. Predict extreme events / outliers
  4. Case study on real world data set – forecasting Beijing PM2.5 pollution

A Univariate case

1. Data generation approach

Let’s say we want to learn the pattern of a sinusoidal wave like below:

sin_univariate

However, the real world data might be way more noisy than this, as shown below:

sin_univariate_noise

So we will sample data in batch from training, with each data as an input-output-sequence pair. The input and output lengths can be different as well. After training, we will feed the model with input sequence of test, and let the model give the predicted final output sequence.

The code for generating the training data are as follows:

input_seq_len = 15
output_seq_len = 20

x = np.linspace(0, 30, 105)
train_data_x = x[:(105-output_seq_len)]

def true_signal(x):
    y = 2 * np.sin(x)
    return y

def noise_func(x, noise_factor = 0.5):
    return np.random.randn(len(x)) * noise_factor

def generate_y_values(x):
    return true_signal(x) + noise_func(x)

def generate_train_samples(x = train_data_x, batch_size = 10, input_seq_len = 15, output_seq_len = 20):

    total_start_points = len(x) - input_seq_len - output_seq_len
    start_x_idx = np.random.choice(range(total_start_points), batch_size)

    input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx]
    output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx]

    input_seq_y = [generate_y_values(x) for x in input_seq_x]
    output_seq_y = [generate_y_values(x) for x in output_seq_x]

    ## return shape: (batch_size, time_steps, feature_dims)
    return np.array(input_seq_y), np.array(output_seq_y)

Please note that we chose input and output sequence lengths to be 15 and 20, respectively. The last 20 elements from the array of ‘x’ are reserved to generate the test labels later on after we made the predictions, so we are not using them for training.

To see what the training data looks like, we can sample and visualise one pair of them:

input_seq, output_seq = generate_train_samples(batch_size=100)

l1, = plt.plot(range(15), input_seq[1], 'go', label = 'input sequence for one sample')
l2, = plt.plot(range(15, 35), output_seq[1], 'ro', label = 'output sequence for one sample')
plt.legend(handles = [l1, l2], loc = 'lower left')
plt.show()

one_train_sample_uni

2. The seq2seq model and how we train it

seq2seq_model

What we will do next is similar to what’s depicted above. The seq2seq model contains two RNNs, e.g. LSTMs, which are the ‘encoder’ and ‘decoder’. Each box here is one RNN cell. The encoder will first run through the whole input sequence (the ‘A’, ‘B’, ‘C’ here), encode them into a fixed length vector – the last hidden state of encoder LSTM – and pass that vector to decoder to decode into the output sequence. (the ‘W’, ‘X’, ‘Y’, ‘Z)

The training approach we adopt here is called ‘guided’ training: basically, during decoding steps, we will first fit a ‘GO’ token as initial input to decoder (this can be a zeros vector), and subsequently, we will provide the correct input to the decoder at every time-step, even if the decoder made a mistake before. However, during test or inference time, the output of the decoder at time t is fed back and becomes the input of the decoder at time t+1.

The code is below:

def _rnn_decoder(decoder_inputs,
                initial_state,
                cell,
                loop_function=None,
                scope=None):
  """RNN decoder for the sequence-to-sequence model.
  Args:
    decoder_inputs: A list of 2D Tensors [batch_size x input_size].
    initial_state: 2D Tensor with shape [batch_size x cell.state_size].
    cell: rnn_cell.RNNCell defining the cell function and size.
    loop_function: If not None, this function will be applied to the i-th output
      in order to generate the i+1-st input, and decoder_inputs will be ignored,
      except for the first element ("GO" symbol). This can be used for decoding,
      but also for training to emulate http://arxiv.org/abs/1506.03099.
      Signature -- loop_function(prev, i) = next
        * prev is a 2D Tensor of shape [batch_size x output_size],
        * i is an integer, the step number (when advanced control is needed),
        * next is a 2D Tensor of shape [batch_size x input_size].
    scope: VariableScope for the created subgraph; defaults to "rnn_decoder".
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors with
        shape [batch_size x output_size] containing generated outputs.
      state: The state of each cell at the final time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
        (Note that in some cases, like basic RNN cell or GRU cell, outputs and
         states can be the same. They are different for LSTM cells though.)
  """
  with variable_scope.variable_scope(scope or "rnn_decoder"):
    state = initial_state
    outputs = []
    prev = None
    for i, inp in enumerate(decoder_inputs):
      if loop_function is not None and prev is not None:
        with variable_scope.variable_scope("loop_function", reuse=True):
          inp = loop_function(prev, i)
      if i > 0:
        variable_scope.get_variable_scope().reuse_variables()
      output, state = cell(inp, state)
      outputs.append(output)
      if loop_function is not None:
        prev = output
  return outputs, state

def _basic_rnn_seq2seq(encoder_inputs,
                      decoder_inputs,
                      cell,
                      feed_previous,
                      dtype=dtypes.float32,
                      scope=None):
  """Basic RNN sequence-to-sequence model.
  This model first runs an RNN to encode encoder_inputs into a state vector,
  then runs decoder, initialized with the last encoder state, on decoder_inputs.
  Encoder and decoder use the same RNN cell type, but don't share parameters.
  Args:
    encoder_inputs: A list of 2D Tensors [batch_size x input_size].
    decoder_inputs: A list of 2D Tensors [batch_size x input_size].
    feed_previous: Boolean; if True, only the first of decoder_inputs will be
      used (the "GO" symbol), all other inputs will be generated by the previous
      decoder output using _loop_function below. If False, decoder_inputs are used
      as given (the standard decoder case).
    dtype: The dtype of the initial state of the RNN cell (default: tf.float32).
    scope: VariableScope for the created subgraph; default: "basic_rnn_seq2seq".
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors with
        shape [batch_size x output_size] containing the generated outputs.
      state: The state of each decoder cell in the final time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
  """
  with variable_scope.variable_scope(scope or "basic_rnn_seq2seq"):
    enc_cell = copy.deepcopy(cell)
    _, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype)
    if feed_previous:
        return _rnn_decoder(decoder_inputs, enc_state, cell, _loop_function)
    else:
        return _rnn_decoder(decoder_inputs, enc_state, cell)

def _loop_function(prev, _):
  '''Naive implementation of loop function for _rnn_decoder. Transform prev from
  dimension [batch_size x hidden_dim] to [batch_size x output_dim], which will be
  used as decoder input of next time step '''
  return tf.matmul(prev, weights['out']) + biases['out']

This is basically the same function from the official seq2seq repo. However, I did make a change to the basic_rnn_seq2seq function by adding in the ‘feed_previous‘ feature, which has not yet been implemented in TensorFlow. This will be feeding decoder output at time t as input for time t+1, as discussed. So during the decoding of inference phase, only the first element of decoder_inputs will be used.

However, during training phase, decoder_inputs will still be used as input at each time step t.

The _loop_function simply serves as a util function to transform the decoder cell output into the dimensions of input, as the LSTM Cell may have different dimensions for both input and output. And we will define the weights[‘out’] and biases[‘out’] later on when we build the model.

Now with the core _basic_rnn_seq2seq function implemented, we can then move on to building the whole graph, including steps such as variables & placeholders declarations, formatting inputs & outputs, as well as computing loss and defining training ops.

Here’s the whole graph building:

from tensorflow.contrib import rnn
from tensorflow.python.ops import variable_scope
from tensorflow.python.framework import dtypes

## Parameters
learning_rate = 0.01
lambda_l2_reg = 0.003  

## Network Parameters
# length of input signals
input_seq_len = 15
# length of output signals
output_seq_len = 20
# size of LSTM Cell
hidden_dim = 64
# num of input signals
input_dim = 1
# num of output signals
output_dim = 1
# num of stacked lstm layers
num_stacked_layers = 2
# gradient clipping - to avoid gradient exploding
GRADIENT_CLIPPING = 2.5 

def build_graph(feed_previous = False):

    tf.reset_default_graph()

    global_step = tf.Variable(
                  initial_value=0,
                  name="global_step",
                  trainable=False,
                  collections=[tf.GraphKeys.GLOBAL_STEP, tf.GraphKeys.GLOBAL_VARIABLES])

    weights = {
        'out': tf.get_variable('Weights_out', \
                               shape = [hidden_dim, output_dim], \
                               dtype = tf.float32, \
                               initializer = tf.truncated_normal_initializer()),
    }
    biases = {
        'out': tf.get_variable('Biases_out', \
                               shape = [output_dim], \
                               dtype = tf.float32, \
                               initializer = tf.constant_initializer(0.)),
    }

    with tf.variable_scope('Seq2seq'):
        # Encoder: inputs
        enc_inp = [
            tf.placeholder(tf.float32, shape=(None, input_dim), name="inp_{}".format(t))
               for t in range(input_seq_len)
        ]

        # Decoder: target outputs
        target_seq = [
            tf.placeholder(tf.float32, shape=(None, output_dim), name="y".format(t))
              for t in range(output_seq_len)
        ]

        # Give a "GO" token to the decoder.
        # If dec_inp are fed into decoder as inputs, this is 'guided' training; otherwise only the
        # first element will be fed as decoder input which is then 'un-guided'
        dec_inp = [ tf.zeros_like(target_seq[0], dtype=tf.float32, name="GO") ] + target_seq[:-1]

        with tf.variable_scope('LSTMCell'):
            cells = []
            for i in range(num_stacked_layers):
                with tf.variable_scope('RNN_{}'.format(i)):
                    cells.append(tf.contrib.rnn.LSTMCell(hidden_dim))
            cell = tf.contrib.rnn.MultiRNNCell(cells)

        def _rnn_decoder(decoder_inputs,
                        initial_state,
                        cell,
                        loop_function=None,
                        scope=None):
          """RNN decoder for the sequence-to-sequence model.
          Args:
            decoder_inputs: A list of 2D Tensors [batch_size x input_size].
            initial_state: 2D Tensor with shape [batch_size x cell.state_size].
            cell: rnn_cell.RNNCell defining the cell function and size.
            loop_function: If not None, this function will be applied to the i-th output
              in order to generate the i+1-st input, and decoder_inputs will be ignored,
              except for the first element ("GO" symbol). This can be used for decoding,
              but also for training to emulate http://arxiv.org/abs/1506.03099.
              Signature -- loop_function(prev, i) = next
                * prev is a 2D Tensor of shape [batch_size x output_size],
                * i is an integer, the step number (when advanced control is needed),
                * next is a 2D Tensor of shape [batch_size x input_size].
            scope: VariableScope for the created subgraph; defaults to "rnn_decoder".
          Returns:
            A tuple of the form (outputs, state), where:
              outputs: A list of the same length as decoder_inputs of 2D Tensors with
                shape [batch_size x output_size] containing generated outputs.
              state: The state of each cell at the final time-step.
                It is a 2D Tensor of shape [batch_size x cell.state_size].
                (Note that in some cases, like basic RNN cell or GRU cell, outputs and
                 states can be the same. They are different for LSTM cells though.)
          """
          with variable_scope.variable_scope(scope or "rnn_decoder"):
            state = initial_state
            outputs = []
            prev = None
            for i, inp in enumerate(decoder_inputs):
              if loop_function is not None and prev is not None:
                with variable_scope.variable_scope("loop_function", reuse=True):
                  inp = loop_function(prev, i)
              if i > 0:
                variable_scope.get_variable_scope().reuse_variables()
              output, state = cell(inp, state)
              outputs.append(output)
              if loop_function is not None:
                prev = output
          return outputs, state

        def _basic_rnn_seq2seq(encoder_inputs,
                              decoder_inputs,
                              cell,
                              feed_previous,
                              dtype=dtypes.float32,
                              scope=None):
          """Basic RNN sequence-to-sequence model.
          This model first runs an RNN to encode encoder_inputs into a state vector,
          then runs decoder, initialized with the last encoder state, on decoder_inputs.
          Encoder and decoder use the same RNN cell type, but don't share parameters.
          Args:
            encoder_inputs: A list of 2D Tensors [batch_size x input_size].
            decoder_inputs: A list of 2D Tensors [batch_size x input_size].
            feed_previous: Boolean; if True, only the first of decoder_inputs will be
              used (the "GO" symbol), all other inputs will be generated by the previous
              decoder output using _loop_function below. If False, decoder_inputs are used
              as given (the standard decoder case).
            dtype: The dtype of the initial state of the RNN cell (default: tf.float32).
            scope: VariableScope for the created subgraph; default: "basic_rnn_seq2seq".
          Returns:
            A tuple of the form (outputs, state), where:
              outputs: A list of the same length as decoder_inputs of 2D Tensors with
                shape [batch_size x output_size] containing the generated outputs.
              state: The state of each decoder cell in the final time-step.
                It is a 2D Tensor of shape [batch_size x cell.state_size].
          """
          with variable_scope.variable_scope(scope or "basic_rnn_seq2seq"):
            enc_cell = copy.deepcopy(cell)
            _, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype)
            if feed_previous:
                return _rnn_decoder(decoder_inputs, enc_state, cell, _loop_function)
            else:
                return _rnn_decoder(decoder_inputs, enc_state, cell)

        def _loop_function(prev, _):
          '''Naive implementation of loop function for _rnn_decoder. Transform prev from
          dimension [batch_size x hidden_dim] to [batch_size x output_dim], which will be
          used as decoder input of next time step '''
          return tf.matmul(prev, weights['out']) + biases['out']

        dec_outputs, dec_memory = _basic_rnn_seq2seq(
            enc_inp,
            dec_inp,
            cell,
            feed_previous = feed_previous
        )

        reshaped_outputs = [tf.matmul(i, weights['out']) + biases['out'] for i in dec_outputs]

    # Training loss and optimizer
    with tf.variable_scope('Loss'):
        # L2 loss
        output_loss = 0
        for _y, _Y in zip(reshaped_outputs, target_seq):
            output_loss += tf.reduce_mean(tf.pow(_y - _Y, 2))

        # L2 regularization for weights and biases
        reg_loss = 0
        for tf_var in tf.trainable_variables():
            if 'Biases_' in tf_var.name or 'Weights_' in tf_var.name:
                reg_loss += tf.reduce_mean(tf.nn.l2_loss(tf_var))

        loss = output_loss + lambda_l2_reg * reg_loss

    with tf.variable_scope('Optimizer'):
        optimizer = tf.contrib.layers.optimize_loss(
                loss=loss,
                learning_rate=learning_rate,
                global_step=global_step,
                optimizer='Adam',
                clip_gradients=GRADIENT_CLIPPING)

    saver = tf.train.Saver

    return dict(
        enc_inp = enc_inp,
        target_seq = target_seq,
        train_op = optimizer,
        loss=loss,
        saver = saver,
        reshaped_outputs = reshaped_outputs,
        )

The ‘feed_previous‘ will be set to True during inference phase, and False during training.

The input and output dims are both 1s for our univariate case, and input and output sequences can have different lengths in this case.

3. Train the model

We will use the ‘guided’ training approach as discussed previously:

total_iteractions = 100
batch_size = 16
KEEP_RATE = 0.5
train_losses = []
val_losses = []

x = np.linspace(0, 30, 105)
train_data_x = x[:85]

rnn_model = build_graph(feed_previous=False)

saver = tf.train.Saver()

init = tf.global_variables_initializer()
with tf.Session() as sess:

    sess.run(init)

    for i in range(total_iteractions):
        batch_input, batch_output = generate_train_samples(batch_size=batch_size)

        feed_dict = {rnn_model['enc_inp'][t]: batch_input[:,t].reshape(-1,input_dim) for t in range(input_seq_len)}
        feed_dict.update({rnn_model['target_seq'][t]: batch_output[:,t].reshape(-1,output_dim) for t in range(output_seq_len)})
        _, loss_t = sess.run([rnn_model['train_op'], rnn_model['loss']], feed_dict)
        print(loss_t)

    temp_saver = rnn_model['saver']()
    save_path = temp_saver.save(sess, os.path.join('/home/weimin/seq2seq', 'univariate_ts_model0'))

print("Checkpoint saved at: ", save_path)

And we are seeing our training loss decreases as expected:

 41.1176
 15.1229
 38.0447
 17.5147
 10.7143
 15.86
 11.0018
 9.36498
 8.26214
 ...
 6.23474
 5.99408
 6.12859
 5.7535
 5.7275
 6.19146
 Checkpoint saved at: /home/weimin/seq2seq/univariate_ts_model0

4. Inference and visualization

For prediction, we will use the last sequence of 15 values from train_data_x as input signal, and let the model predict without feeding in the true labels (by setting ‘feed_previous‘ to True)

test_seq_input = true_signal(train_data_x[-15:])

rnn_model = build_graph(feed_previous=True)

init = tf.global_variables_initializer()
with tf.Session() as sess:

    sess.run(init)

    saver = rnn_model['saver']().restore(sess, os.path.join('/home/weimin/seq2seq', 'univariate_ts_model0'))

    feed_dict = {rnn_model['enc_inp'][t]: test_seq_input[t].reshape(1,1) for t in range(input_seq_len)}
    feed_dict.update({rnn_model['target_seq'][t]: np.zeros([1, output_dim]) for t in range(output_seq_len)})
    final_preds = sess.run(rnn_model['reshaped_outputs'], feed_dict)

    final_preds = np.concatenate(final_preds, axis = 1)

We will then visualize the predictions against the true labels below:

l1, = plt.plot(range(85), true_signal(train_data_x[:85]), label = 'Training truth')
l2, = plt.plot(range(85, 105), y[85:], 'yo', label = 'Test truth')
l3, = plt.plot(range(85, 105), final_preds.reshape(-1), 'ro', label = 'Test predictions')
plt.legend(handles = [l1, l2, l3], loc = 'lower left')
plt.show()

test_case_uni

Through a few iterations of learning, the model has already figured out the hidden signal from the noisy training data, which is amazing.

univariate_noisy_training_hidden_truth.png

Multivariate case

This one should be more exciting!

So what if we have more than one signals for input and output? And this is especially true for real world scenario. Let’s say we would like to predict the supply & demand for the next 30 minutes; what affect the results may not only be the past supply & demand, but also for things like weather, rain, temperature, app usages, traffic signals, whether it is a public holiday, etc.

1. Simulate input and output signals

We start with two cosine and sine signals x1 and x2 – these are our outputs to predict! From them, we derive three additional signals – y1, y2, y2 – as our input sequences, using some random formula as below:

x = np.linspace(0, 40, 130)
x1 = 2*np.sin(x)
x2 = 2*np.cos(x)

## random formula to generate input sequences
y1 = 1.6*x1**4 - 2*x2 - 10
y2 = 1.2*x2**2*x1 + 6*x2 - 6*x1
y3 = 2*x1**3 + 2*x2**3 - x1*x2

So overall, the structure will be like 3-in-2-out.

We can also visualize inputs as follow:

three_signals_multi_rgm

Put it another way, we want our model to uncover the hidden signals from the observed ones, as plotted below:

hidden_facts

This is more interesting as we have reached stage of being flexible both on the number of signals as well as length of sequences.

You are free to explore different number of input and output sequences, if you want to.

2. Sampling training data

Let’s rewrite data sampling mechanism:

input_seq_len = 15
output_seq_len = 20
x = np.linspace(0, 40, 130)
train_data_x = x[:(130 - output_seq_len)]

def true_output_signals(x):
    x1 = 2 * np.sin(x)
    x2 = 2 * np.cos(x)
    return x1, x2

def true_input_signals(x):
    x1, x2 = true_output_signals(x)
    y1 = 1.6*x1**4 - 2*x2 - 10
    y2 = 1.2*x2**2 * x1 + 2*x2*3 - x1*6
    y3 = 2*x1**3 + 2*x2**3 - x1*x2
    return y1, y2, y3

def noise_func(x, noise_factor = 2):
    return np.random.randn(len(x)) * noise_factor

def generate_samples_for_output(x):
    x1, x2 = true_output_signals(x)
    return x1+noise_func(x1, 0.5), \
           x2+noise_func(x2, 0.5)

def generate_samples_for_input(x):
    y1, y2, y3 = true_input_signals(x)
    return y1+noise_func(y1, 2), \
           y2+noise_func(y2, 2), \
           y3+noise_func(y3, 2)

def generate_train_samples(x = train_data_x, batch_size = 10):

    total_start_points = len(x) - input_seq_len - output_seq_len
    start_x_idx = np.random.choice(range(total_start_points), batch_size)

    input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx]
    output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx]

    input_seq_y = [generate_samples_for_input(x) for x in input_seq_x]
    output_seq_y = [generate_samples_for_output(x) for x in output_seq_x]

    ## return shape: (batch_size, time_steps, feature_dims)
    return np.array(input_seq_y).transpose(0, 2, 1), np.array(output_seq_y).transpose(0, 2, 1)

As usual, we can visualize one training sample:

train sample multi

3. Build the model

Luckily, we don’t need to change too much of the code for building the graph. What we need to change are only the input_dim = 3 and output_dim = 2.

I will skip pasting the same codes here.

4. Train the model

Training is also no much different. We will again set the ‘feed_previous‘ as False for ‘guided’ training – feeding in the correct decoder input at each time step. So here’s the code:

total_iteractions = 100
batch_size = 16
KEEP_RATE = 0.5
train_losses = []
val_losses = []

x = np.linspace(0, 40, 130)
train_data_x = x[:110]

rnn_model = build_graph(feed_previous=False)

saver = tf.train.Saver()

init = tf.global_variables_initializer()
with tf.Session() as sess:

    sess.run(init)

    print("Training losses: ")
    for i in range(total_iteractions):
        batch_input, batch_output = generate_train_samples(batch_size=batch_size)

        feed_dict = {rnn_model['enc_inp'][t]: batch_input[:,t] for t in range(input_seq_len)}
        feed_dict.update({rnn_model['target_seq'][t]: batch_output[:,t] for t in range(output_seq_len)})
        _, loss_t = sess.run([rnn_model['train_op'], rnn_model['loss']], feed_dict)
        print(loss_t)

    temp_saver = rnn_model['saver']()
    save_path = temp_saver.save(sess, os.path.join('/home/weimin/seq2seq', 'multivariate_ts_model0'))

print("Checkpoint saved at: ", save_path)

5. Inference and visualisation

Once training is done, we can visualise the prediction performance on the test output sequence.

test_results_multi_outputOnly

To put them together, we can also visualize both the input and output sequences at the same time (I scaled up the output sequences just to make them more obvious):

test_case_results_multi

Outliers / Extreme events

1. What could be an outlier / extreme event case?

outlier

This will be very useful for companies when are dealing with real time traffic, for example. There are many factors that could be the triggers for anomaly events, for example, the sudden rain of the day could lead to surge of demand of taxis on the road. Public holidays could be another factor which might lead to increase in demands for the whole day.

2. Case & data simulation

Let’s assume our traffic patterns are like below, where y-axis is the number of traffic, and x-axis is the days.

Image that during public holidays, we will notice an additional 2 units of traffic will be added on top of what we have for that day, as reflected by those sharp peaks in the graph.

traffic_days_extreme

Of course, the real data will have noise for both traffic on normal days as well as on public holidays (which means the true effect of PH – 2 units – is unknown to us, which the model needs to learn through the data). What we do know, however, is whether each day is a public holiday (1) or not (0). Here’s the code for data generation:

input_seq_len = 15
output_seq_len = 20
extreme_factor = 2 # 2 additional units of traffic increase if extreme / outlier

x = np.linspace(0, 60, 210)
train_data_x = x[:190]

np.random.seed(10)
num_events_train = 40 # total number of extreme events in training data
train_events_bool = np.zeros(190)
train_events_x = np.random.choice(range(190), num_events_train)
train_events_bool[train_events_x] = 1

num_events_test = 1 # total number of extreme events in test data
test_events_bool = np.zeros(20)
test_events_x = np.random.choice(range(20), num_events_test)
test_events_bool[test_events_bool] = 1

def true_signal(x):
    y = 2 * np.sin(x)
    return y

def noise_func(x, noise_factor = 1):
    return np.random.randn(len(x)) * noise_factor

def generate_y_values(x, event_bool):
    return true_signal(x) + noise_func(x) + extreme_factor * event_bool

def generate_train_samples(x = train_data_x, batch_size = 10, input_seq_len = input_seq_len, output_seq_len = output_seq_len, train_events_bool = train_events_bool):

    total_start_points = len(x) - input_seq_len - output_seq_len
    start_x_idx = np.random.choice(range(total_start_points), batch_size)

    input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx]
    input_seq_event_bool = [train_events_bool[i:(i+input_seq_len)] for i in start_x_idx]

    output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx]
    output_seq_event_bool = [train_events_bool[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx]

    input_seq_y = [generate_y_values(x, event_bool) for x, event_bool in zip(input_seq_x, input_seq_event_bool)]
    output_seq_y = [generate_y_values(x, event_bool) for x, event_bool in zip(output_seq_x, output_seq_event_bool)]

    #batch_x = np.array([[true_signal()]])
    return np.array(input_seq_y), np.array(output_seq_y), np.array(input_seq_event_bool), np.array(output_seq_event_bool)

As always, we can sample a training data to see what it looks like:

input_seq, output_seq, input_seq_event_bool, output_seq_event_bool = generate_train_samples(batch_size=10)
l1, = plt.plot(range(15), input_seq[0], 'go', label = 'input sequence for one sample')
l2, = plt.plot(range(15, 35), output_seq[0], 'ro', label = 'output sequence for one sample')
plt.legend(handles = [l1, l2], loc = 'lower left')
plt.show()

one_sample_train_extreme_events

Quite noisy to see the extreme points though. However, you can simply turn off the noise factor (set it to zero) to see clearly the extreme events. But I will skip this verification step here just to save the space.

3. If we train the same seq2seq as before …

Now if we use the simulated data to train our seq2seq model previously, we won’t get a very good result. The one outlier in the test data will not be captured by the model. (It’s easy to understand this as the extreme events are totally random, and model has no idea which days they will fall on for both training and test phases.)

So if you test the model, you should see something similar to this:

extreme_events_without_bool_200itera.png

  1. The model has poor fit for test data. 
  2. The model has no awareness / differentiation on the outlier.  

4. Re-design the seq2seq

How do we redesign the model to deal with extreme events?

Since we already have the knowledge of whether each day is a public holiday, we can pass that ‘knowledge’ to the decoder at each time step i, to let the decoder learn the differentiation between normal days and public holidays. 

In summary, here are the steps:

  1. Get the bool vector for output sequence (‘1’ for public holiday, ‘0’ for normal days)
  2. During decoding, at each time step t, concatenate the bool value to the original decoder input as new input
  3. Feed the new input to decoder to generate output at t
  4. Continue until the decoding phase is done

We won’t change the rest parts such as ‘guided’ training, and we also won’t touch the encoding phase. With these steps implemented, we should be able to see our improved results on test data:

extreme_events_with_bool_2

Overall it fits well, and most importantly, the outlier point was perfectly captured!

Please refer to my github page for complete codes and notebook.

Case study on Beijing PM2.5 pollution forecasting

1. The case introduction.

Finally, we will apply what we learnt on a real-world data set – the hourly data set that contains the PM2.5 data of US Embassy in Beijing.

If can download the data from here. The complete list of features included in the raw data set is:

  1. No: row number
  2. year: year of data in this row
  3. month: month of data in this row
  4. day: day of data in this row
  5. hour: hour of data in this row
  6. pm2.5: PM2.5 concentration
  7. DEWP: Dew Point
  8. TEMP: Temperature
  9. PRES: Pressure
  10. cbwd: Combined wind direction
  11. Iws: Cumulated wind speed
  12. Is: Cumulated hours of snow
  13. Ir: Cumulated hours of rain

Our task here, however, is only to predict the pollution factor – pm2.5 – for the next few hours, based on historical trends of other features. In short, it’s going to be a multi-variate case with n-input-1-ouput.

2. Visualisation and pre-processing of data

Beijing_data_head.png

beijing_vis

A quick df.head() function plus simple plots give us an intuitive understanding of the data set.

All features plotted above are numeric / continuous. However, there’s one more called cbwd which is categorical, because it is referring to the wind directions. We will do one-hot encoding for that. I will not be using data such as year, month, day or hour as my features. However, you are encouraged to add in those as features as well, to see if those might help your performance.

Further more, there are no NAs values for all features except for pm2.5 – there are 2067 out of 43824 are NAs in pm2.5. And we will simply fill NAs with 0.

Regarding train and test splits – I will use last one month data for test, and all data before that as my training sample. Feel free to split according to your preferences.

Finally, I will do a simply processing step to normalize all data using z-score.

The overall scripts are below:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

df = pd.read_csv('./PRSA_data_2010.1.1-2014.12.31.csv')
print(df.head())

cols_to_plot = ["pm2.5", "DEWP", "TEMP", "PRES", "Iws", "Is", "Ir"]
i = 1
# plot each column
plt.figure(figsize = (10,12))
for col in cols_to_plot:
    plt.subplot(len(cols_to_plot), 1, i)
    plt.plot(df[col])
    plt.title(col, y=0.5, loc='left')
    i += 1
plt.show()

## Fill NA with 0
#print(df.isnull().sum())
df.fillna(0, inplace = True)

## One-hot encode 'cbwd'
temp = pd.get_dummies(df['cbwd'], prefix='cbwd')
df = pd.concat([df, temp], axis = 1)
del df['cbwd'], temp

## Split into train and test - I used the last 1 month data as test, but it's up to you to decide the ratio
df_train = df.iloc[:(-31*24), :].copy()
df_test = df.iloc[-31*24:, :].copy()

## take out the useful columns for modeling - you may also keep 'hour', 'day' or 'month' and to see if that will improve your accuracy
X_train = df_train.loc[:, ['pm2.5', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Is', 'Ir', 'cbwd_NE', 'cbwd_NW', 'cbwd_SE', 'cbwd_cv']].values.copy()
X_test = df_test.loc[:, ['pm2.5', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Is', 'Ir', 'cbwd_NE', 'cbwd_NW', 'cbwd_SE', 'cbwd_cv']].values.copy()
y_train = df_train['pm2.5'].values.copy().reshape(-1, 1)
y_test = df_test['pm2.5'].values.copy().reshape(-1, 1)

## z-score transform x - not including those one-how columns!
for i in range(X_train.shape[1]-4):
    temp_mean = X_train[:, i].mean()
    temp_std = X_train[:, i].std()
    X_train[:, i] = (X_train[:, i] - temp_mean) / temp_std
    X_test[:, i] = (X_test[:, i] - temp_mean) / temp_std

## z-score transform y
y_mean = y_train.mean()
y_std = y_train.std()
y_train = (y_train - y_mean) / y_std
y_test = (y_test - y_mean) / y_std

3. Transforming training and test data into 3-D formats.

Now we have X_train, y_train, X_test, y_test as our train and test data sets, we will then need to transform them into 3-D formats for time series – (batch_size, time_steps, feature_dim)

We will prepare two util functions to do so:

input_seq_len = 30
output_seq_len = 5

def generate_train_samples(x = X_train, y = y_train, batch_size = 10, input_seq_len = input_seq_len, output_seq_len = output_seq_len):

    total_start_points = len(x) - input_seq_len - output_seq_len
    start_x_idx = np.random.choice(range(total_start_points), batch_size, replace = False)

    input_batch_idxs = [list(range(i, i+input_seq_len)) for i in start_x_idx]
    input_seq = np.take(x, input_batch_idxs, axis = 0)

    output_batch_idxs = [list(range(i+input_seq_len, i+input_seq_len+output_seq_len)) for i in start_x_idx]
    output_seq = np.take(y, output_batch_idxs, axis = 0)

    return input_seq, output_seq # in shape: (batch_size, time_steps, feature_dim)

def generate_test_samples(x = X_test, y = y_test, input_seq_len = input_seq_len, output_seq_len = output_seq_len):

    total_samples = x.shape[0]

    input_batch_idxs = [list(range(i, i+input_seq_len)) for i in range((total_samples-input_seq_len-output_seq_len))]
    input_seq = np.take(x, input_batch_idxs, axis = 0)

    output_batch_idxs = [list(range(i+input_seq_len, i+input_seq_len+output_seq_len)) for i in range((total_samples-input_seq_len-output_seq_len))]
    output_seq = np.take(y, output_batch_idxs, axis = 0)

    return input_seq, output_seq

So generate_train_samples will randomly sample batches from training, and generate_test_samples will simply convert all test data into 3-D formats in sequence. We can print out the shapes of the processed data:

x, y = generate_train_samples()
print(x.shape, y.shape)

test_x, test_y = generate_test_samples()
print(test_x.shape, test_y.shape)
(10, 30, 11) (10, 5, 1)
(709, 30, 11) (709, 5, 1)

4. Train the model and visualize the results on test.

The model to use will be no different than the multi-variate one we used before. So I will not duplicate that part here.

Training part is also similar, we can set feed_previous to False for guided training.

After training is done, we will visualize the results over the last month to evaluate our prediction performance, as below:

Beijing_final_result.png

Well, slightly off but not too bad :) The orange is predicted and blue is actual value for each hour. I believe with some tweaks on the hyperparameters one might get a better results!

Any thoughts?

In this tutorial, we have demonstrated how to use sequence to sequence architecture to learn and predict Time Series signals. We learnt that seq2seq model really has the power to sift through the noisy information, uncover the true patterns, and eventually comes down to discovering the hidden signals that are actually governing the data generation mechanism.

We have seen that the model actually performs equally well on both univariate as well as multivariate cases, with the additional capability of encoding and decoding variable lengths of signals. We have also witnessed the flexibility of re-designing the seq2seq model to take care of the extreme events scenario, which is powerful predict the ‘unpredictability’ of the real world.

What other problems may be more challenging and useful to tackle in your business settings? Please kindly share your ideas or thoughts, and we will see how to design the seq2seq to fit your needs.

References

TensorFlow official repo – https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py

Another excellent blog about seq2seq for Time Series –    https://github.com/guillaume-chevalier/seq2seq-signal-prediction

If you would like to try Keras –                    https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

7 thoughts on “The Amazing Effectiveness of Sequence to Sequence Model for Time Series

  1. Great post. I am a beginner in Data Science/Machine Learning but I can’t wait to deepen my knowledge in multi variate time-series forecasts. It appears to be a very promising field. I’m going to print your post and read it carefully multiple times until I’ll be ready to deploy my own neural network with tensorflow.

    Like

    1. Glad to hear that! I am happy that this post might help you in some way and wish you great progress in ML!

      Like

  2. Hi Weimin,
    This is an excellent post, thank you!
    Do you think it would be possible to feed in the next day’s weather forecast to the decoder similar to how you’ve fed in the public holiday Boolean value?
    This would make the system very flexible – for example an expert could create a ‘best-guess’ time series for the next day and feed that into the decoder to help it out. There would be many other possible inputs to provide here.
    Regards,
    Michael J

    Like

  3. Hi Weimin,
    This is an excellent post, thank you!
    Do you think it would be possible to feed in the next day’s weather forecast to the decoder similar to how you’ve fed in the public holiday Boolean value?
    This would make the system very flexible – for example an expert could create a ‘best-guess’ time series for the next day and feed that into the decoder to help it out. There would be many other possible inputs to provide here.
    Regards,
    Michael J

    Like

    1. Hi Michael,

      Thanks for your comment. I would strongly encourage you to try this and see whether this can improve the forecast accuracy compared to model with no extra input, and share your results:). Yes indeed there are other possible inputs to provide in addition to boolean values like you said. And it will also depend on the data set.

      Thanks,
      Weimin

      Like

      1. I think I’ll give it a shot!
        I found a paper that uses the same technique, passing in the day, week and hour to the decoder:
        https://arxiv.org/abs/1610.09460
        It looks like they achieved good results.

        Like

      2. Thanks for the sharing. I checked the paper and it was quite an interesting case.

        I think I may also try this dataset to forecast weather polution – https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data

        If I can get some interesting results, I will add it in to the post too. Stay tuned!

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close