The Amazing Effectiveness of Sequence to Sequence Model for Time Series

In this tutorial, I am excited to showcase examples of building Time Series forecasting model with seq2seq in TensorFlow. The purpose of this post is to give an intuitive as well as technical understanding of the implementations, and to demonstrate the two useful features under the hood:

  1. Multivariate input and output signals
  2. Variable input and output sequence lengths

After that, I will demonstrate how to apply what we have covered on something very interesting – to predict extreme events / outliers. 

Finally, we will end this tutorial by applying what we learnt into a real world case – forecasting Beijing pm2.5 pollution.

At any time, please feel free to jump to python notebook at my github if you want to skip reading.

Organisational structures:

  1. Univariate case
  2. Multivariate case
  3. Predict extreme events / outliers
  4. Case study on real world data set – forecasting Beijing PM2.5 pollution

A Univariate case

1. Data generation approach

Let’s say we want to learn the pattern of a sinusoidal wave like below:

sin_univariate

However, the real world data might be way more noisy than this, as shown below:

sin_univariate_noise

So we will sample data in batch from training, with each data as an input-output-sequence pair. The input and output lengths can be different as well. After training, we will feed the model with input sequence of test, and let the model give the predicted final output sequence.

The code for generating the training data are as follows:

input_seq_len = 15
output_seq_len = 20

x = np.linspace(0, 30, 105)
train_data_x = x[:(105-output_seq_len)]

def true_signal(x):
    y = 2 * np.sin(x)
    return y

def noise_func(x, noise_factor = 0.5):
    return np.random.randn(len(x)) * noise_factor

def generate_y_values(x):
    return true_signal(x) + noise_func(x)

def generate_train_samples(x = train_data_x, batch_size = 10, input_seq_len = 15, output_seq_len = 20):

    total_start_points = len(x) - input_seq_len - output_seq_len
    start_x_idx = np.random.choice(range(total_start_points), batch_size)

    input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx]
    output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx]

    input_seq_y = [generate_y_values(x) for x in input_seq_x]
    output_seq_y = [generate_y_values(x) for x in output_seq_x]

    ## return shape: (batch_size, time_steps, feature_dims)
    return np.array(input_seq_y), np.array(output_seq_y)

Please note that we chose input and output sequence lengths to be 15 and 20, respectively. The last 20 elements from the array of ‘x’ are reserved to generate the test labels later on after we made the predictions, so we are not using them for training.

To see what the training data looks like, we can sample and visualise one pair of them:

input_seq, output_seq = generate_train_samples(batch_size=100)

l1, = plt.plot(range(15), input_seq[1], 'go', label = 'input sequence for one sample')
l2, = plt.plot(range(15, 35), output_seq[1], 'ro', label = 'output sequence for one sample')
plt.legend(handles = [l1, l2], loc = 'lower left')
plt.show()

one_train_sample_uni

2. The seq2seq model and how we train it

seq2seq_model

What we will do next is similar to what’s depicted above. The seq2seq model contains two RNNs, e.g. LSTMs, which are the ‘encoder’ and ‘decoder’. Each box here is one RNN cell. The encoder will first run through the whole input sequence (the ‘A’, ‘B’, ‘C’ here), encode them into a fixed length vector – the last hidden state of encoder LSTM – and pass that vector to decoder to decode into the output sequence. (the ‘W’, ‘X’, ‘Y’, ‘Z)

The training approach we adopt here is called ‘guided’ training: basically, during decoding steps, we will first fit a ‘GO’ token as initial input to decoder (this can be a zeros vector), and subsequently, we will provide the correct input to the decoder at every time-step, even if the decoder made a mistake before. However, during test or inference time, the output of the decoder at time t is fed back and becomes the input of the decoder at time t+1.

The code is below:

def _rnn_decoder(decoder_inputs,
                initial_state,
                cell,
                loop_function=None,
                scope=None):
  """RNN decoder for the sequence-to-sequence model.
  Args:
    decoder_inputs: A list of 2D Tensors [batch_size x input_size].
    initial_state: 2D Tensor with shape [batch_size x cell.state_size].
    cell: rnn_cell.RNNCell defining the cell function and size.
    loop_function: If not None, this function will be applied to the i-th output
      in order to generate the i+1-st input, and decoder_inputs will be ignored,
      except for the first element ("GO" symbol). This can be used for decoding,
      but also for training to emulate http://arxiv.org/abs/1506.03099.
      Signature -- loop_function(prev, i) = next
        * prev is a 2D Tensor of shape [batch_size x output_size],
        * i is an integer, the step number (when advanced control is needed),
        * next is a 2D Tensor of shape [batch_size x input_size].
    scope: VariableScope for the created subgraph; defaults to "rnn_decoder".
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors with
        shape [batch_size x output_size] containing generated outputs.
      state: The state of each cell at the final time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
        (Note that in some cases, like basic RNN cell or GRU cell, outputs and
         states can be the same. They are different for LSTM cells though.)
  """
  with variable_scope.variable_scope(scope or "rnn_decoder"):
    state = initial_state
    outputs = []
    prev = None
    for i, inp in enumerate(decoder_inputs):
      if loop_function is not None and prev is not None:
        with variable_scope.variable_scope("loop_function", reuse=True):
          inp = loop_function(prev, i)
      if i > 0:
        variable_scope.get_variable_scope().reuse_variables()
      output, state = cell(inp, state)
      outputs.append(output)
      if loop_function is not None:
        prev = output
  return outputs, state

def _basic_rnn_seq2seq(encoder_inputs,
                      decoder_inputs,
                      cell,
                      feed_previous,
                      dtype=dtypes.float32,
                      scope=None):
  """Basic RNN sequence-to-sequence model.
  This model first runs an RNN to encode encoder_inputs into a state vector,
  then runs decoder, initialized with the last encoder state, on decoder_inputs.
  Encoder and decoder use the same RNN cell type, but don't share parameters.
  Args:
    encoder_inputs: A list of 2D Tensors [batch_size x input_size].
    decoder_inputs: A list of 2D Tensors [batch_size x input_size].
    feed_previous: Boolean; if True, only the first of decoder_inputs will be
      used (the "GO" symbol), all other inputs will be generated by the previous
      decoder output using _loop_function below. If False, decoder_inputs are used
      as given (the standard decoder case).
    dtype: The dtype of the initial state of the RNN cell (default: tf.float32).
    scope: VariableScope for the created subgraph; default: "basic_rnn_seq2seq".
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors with
        shape [batch_size x output_size] containing the generated outputs.
      state: The state of each decoder cell in the final time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
  """
  with variable_scope.variable_scope(scope or "basic_rnn_seq2seq"):
    enc_cell = copy.deepcopy(cell)
    _, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype)
    if feed_previous:
        return _rnn_decoder(decoder_inputs, enc_state, cell, _loop_function)
    else:
        return _rnn_decoder(decoder_inputs, enc_state, cell)

def _loop_function(prev, _):
  '''Naive implementation of loop function for _rnn_decoder. Transform prev from
  dimension [batch_size x hidden_dim] to [batch_size x output_dim], which will be
  used as decoder input of next time step '''
  return tf.matmul(prev, weights['out']) + biases['out']

This is basically the same function from the official seq2seq repo. However, I did make a change to the basic_rnn_seq2seq function by adding in the ‘feed_previous‘ feature, which has not yet been implemented in TensorFlow. This will be feeding decoder output at time t as input for time t+1, as discussed. So during the decoding of inference phase, only the first element of decoder_inputs will be used.

However, during training phase, decoder_inputs will still be used as input at each time step t.

The _loop_function simply serves as a util function to transform the decoder cell output into the dimensions of input, as the LSTM Cell may have different dimensions for both input and output. And we will define the weights[‘out’] and biases[‘out’] later on when we build the model.

Now with the core _basic_rnn_seq2seq function implemented, we can then move on to building the whole graph, including steps such as variables & placeholders declarations, formatting inputs & outputs, as well as computing loss and defining training ops.

Here’s the whole graph building:

from tensorflow.contrib import rnn
from tensorflow.python.ops import variable_scope
from tensorflow.python.framework import dtypes

## Parameters
learning_rate = 0.01
lambda_l2_reg = 0.003  

## Network Parameters
# length of input signals
input_seq_len = 15
# length of output signals
output_seq_len = 20
# size of LSTM Cell
hidden_dim = 64
# num of input signals
input_dim = 1
# num of output signals
output_dim = 1
# num of stacked lstm layers
num_stacked_layers = 2
# gradient clipping - to avoid gradient exploding
GRADIENT_CLIPPING = 2.5 

def build_graph(feed_previous = False):

    tf.reset_default_graph()

    global_step = tf.Variable(
                  initial_value=0,
                  name="global_step",
                  trainable=False,
                  collections=[tf.GraphKeys.GLOBAL_STEP, tf.GraphKeys.GLOBAL_VARIABLES])

    weights = {
        'out': tf.get_variable('Weights_out', \
                               shape = [hidden_dim, output_dim], \
                               dtype = tf.float32, \
                               initializer = tf.truncated_normal_initializer()),
    }
    biases = {
        'out': tf.get_variable('Biases_out', \
                               shape = [output_dim], \
                               dtype = tf.float32, \
                               initializer = tf.constant_initializer(0.)),
    }

    with tf.variable_scope('Seq2seq'):
        # Encoder: inputs
        enc_inp = [
            tf.placeholder(tf.float32, shape=(None, input_dim), name="inp_{}".format(t))
               for t in range(input_seq_len)
        ]

        # Decoder: target outputs
        target_seq = [
            tf.placeholder(tf.float32, shape=(None, output_dim), name="y".format(t))
              for t in range(output_seq_len)
        ]

        # Give a "GO" token to the decoder.
        # If dec_inp are fed into decoder as inputs, this is 'guided' training; otherwise only the
        # first element will be fed as decoder input which is then 'un-guided'
        dec_inp = [ tf.zeros_like(target_seq[0], dtype=tf.float32, name="GO") ] + target_seq[:-1]

        with tf.variable_scope('LSTMCell'):
            cells = []
            for i in range(num_stacked_layers):
                with tf.variable_scope('RNN_{}'.format(i)):
                    cells.append(tf.contrib.rnn.LSTMCell(hidden_dim))
            cell = tf.contrib.rnn.MultiRNNCell(cells)

        def _rnn_decoder(decoder_inputs,
                        initial_state,
                        cell,
                        loop_function=None,
                        scope=None):
          """RNN decoder for the sequence-to-sequence model.
          Args:
            decoder_inputs: A list of 2D Tensors [batch_size x input_size].
            initial_state: 2D Tensor with shape [batch_size x cell.state_size].
            cell: rnn_cell.RNNCell defining the cell function and size.
            loop_function: If not None, this function will be applied to the i-th output
              in order to generate the i+1-st input, and decoder_inputs will be ignored,
              except for the first element ("GO" symbol). This can be used for decoding,
              but also for training to emulate http://arxiv.org/abs/1506.03099.
              Signature -- loop_function(prev, i) = next
                * prev is a 2D Tensor of shape [batch_size x output_size],
                * i is an integer, the step number (when advanced control is needed),
                * next is a 2D Tensor of shape [batch_size x input_size].
            scope: VariableScope for the created subgraph; defaults to "rnn_decoder".
          Returns:
            A tuple of the form (outputs, state), where:
              outputs: A list of the same length as decoder_inputs of 2D Tensors with
                shape [batch_size x output_size] containing generated outputs.
              state: The state of each cell at the final time-step.
                It is a 2D Tensor of shape [batch_size x cell.state_size].
                (Note that in some cases, like basic RNN cell or GRU cell, outputs and
                 states can be the same. They are different for LSTM cells though.)
          """
          with variable_scope.variable_scope(scope or "rnn_decoder"):
            state = initial_state
            outputs = []
            prev = None
            for i, inp in enumerate(decoder_inputs):
              if loop_function is not None and prev is not None:
                with variable_scope.variable_scope("loop_function", reuse=True):
                  inp = loop_function(prev, i)
              if i > 0:
                variable_scope.get_variable_scope().reuse_variables()
              output, state = cell(inp, state)
              outputs.append(output)
              if loop_function is not None:
                prev = output
          return outputs, state

        def _basic_rnn_seq2seq(encoder_inputs,
                              decoder_inputs,
                              cell,
                              feed_previous,
                              dtype=dtypes.float32,
                              scope=None):
          """Basic RNN sequence-to-sequence model.
          This model first runs an RNN to encode encoder_inputs into a state vector,
          then runs decoder, initialized with the last encoder state, on decoder_inputs.
          Encoder and decoder use the same RNN cell type, but don't share parameters.
          Args:
            encoder_inputs: A list of 2D Tensors [batch_size x input_size].
            decoder_inputs: A list of 2D Tensors [batch_size x input_size].
            feed_previous: Boolean; if True, only the first of decoder_inputs will be
              used (the "GO" symbol), all other inputs will be generated by the previous
              decoder output using _loop_function below. If False, decoder_inputs are used
              as given (the standard decoder case).
            dtype: The dtype of the initial state of the RNN cell (default: tf.float32).
            scope: VariableScope for the created subgraph; default: "basic_rnn_seq2seq".
          Returns:
            A tuple of the form (outputs, state), where:
              outputs: A list of the same length as decoder_inputs of 2D Tensors with
                shape [batch_size x output_size] containing the generated outputs.
              state: The state of each decoder cell in the final time-step.
                It is a 2D Tensor of shape [batch_size x cell.state_size].
          """
          with variable_scope.variable_scope(scope or "basic_rnn_seq2seq"):
            enc_cell = copy.deepcopy(cell)
            _, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype)
            if feed_previous:
                return _rnn_decoder(decoder_inputs, enc_state, cell, _loop_function)
            else:
                return _rnn_decoder(decoder_inputs, enc_state, cell)

        def _loop_function(prev, _):
          '''Naive implementation of loop function for _rnn_decoder. Transform prev from
          dimension [batch_size x hidden_dim] to [batch_size x output_dim], which will be
          used as decoder input of next time step '''
          return tf.matmul(prev, weights['out']) + biases['out']

        dec_outputs, dec_memory = _basic_rnn_seq2seq(
            enc_inp,
            dec_inp,
            cell,
            feed_previous = feed_previous
        )

        reshaped_outputs = [tf.matmul(i, weights['out']) + biases['out'] for i in dec_outputs]

    # Training loss and optimizer
    with tf.variable_scope('Loss'):
        # L2 loss
        output_loss = 0
        for _y, _Y in zip(reshaped_outputs, target_seq):
            output_loss += tf.reduce_mean(tf.pow(_y - _Y, 2))

        # L2 regularization for weights and biases
        reg_loss = 0
        for tf_var in tf.trainable_variables():
            if 'Biases_' in tf_var.name or 'Weights_' in tf_var.name:
                reg_loss += tf.reduce_mean(tf.nn.l2_loss(tf_var))

        loss = output_loss + lambda_l2_reg * reg_loss

    with tf.variable_scope('Optimizer'):
        optimizer = tf.contrib.layers.optimize_loss(
                loss=loss,
                learning_rate=learning_rate,
                global_step=global_step,
                optimizer='Adam',
                clip_gradients=GRADIENT_CLIPPING)

    saver = tf.train.Saver

    return dict(
        enc_inp = enc_inp,
        target_seq = target_seq,
        train_op = optimizer,
        loss=loss,
        saver = saver,
        reshaped_outputs = reshaped_outputs,
        )

The ‘feed_previous‘ will be set to True during inference phase, and False during training.

The input and output dims are both 1s for our univariate case, and input and output sequences can have different lengths in this case.

3. Train the model

We will use the ‘guided’ training approach as discussed previously:

total_iteractions = 100
batch_size = 16
KEEP_RATE = 0.5
train_losses = []
val_losses = []

x = np.linspace(0, 30, 105)
train_data_x = x[:85]

rnn_model = build_graph(feed_previous=False)

saver = tf.train.Saver()

init = tf.global_variables_initializer()
with tf.Session() as sess:

    sess.run(init)

    for i in range(total_iteractions):
        batch_input, batch_output = generate_train_samples(batch_size=batch_size)

        feed_dict = {rnn_model['enc_inp'][t]: batch_input[:,t].reshape(-1,input_dim) for t in range(input_seq_len)}
        feed_dict.update({rnn_model['target_seq'][t]: batch_output[:,t].reshape(-1,output_dim) for t in range(output_seq_len)})
        _, loss_t = sess.run([rnn_model['train_op'], rnn_model['loss']], feed_dict)
        print(loss_t)

    temp_saver = rnn_model['saver']()
    save_path = temp_saver.save(sess, os.path.join('/home/weimin/seq2seq', 'univariate_ts_model0'))

print("Checkpoint saved at: ", save_path)

And we are seeing our training loss decreases as expected:

 41.1176
 15.1229
 38.0447
 17.5147
 10.7143
 15.86
 11.0018
 9.36498
 8.26214
 ...
 6.23474
 5.99408
 6.12859
 5.7535
 5.7275
 6.19146
 Checkpoint saved at: /home/weimin/seq2seq/univariate_ts_model0

4. Inference and visualization

For prediction, we will use the last sequence of 15 values from train_data_x as input signal, and let the model predict without feeding in the true labels (by setting ‘feed_previous‘ to True)

test_seq_input = true_signal(train_data_x[-15:])

rnn_model = build_graph(feed_previous=True)

init = tf.global_variables_initializer()
with tf.Session() as sess:

    sess.run(init)

    saver = rnn_model['saver']().restore(sess, os.path.join('/home/weimin/seq2seq', 'univariate_ts_model0'))

    feed_dict = {rnn_model['enc_inp'][t]: test_seq_input[t].reshape(1,1) for t in range(input_seq_len)}
    feed_dict.update({rnn_model['target_seq'][t]: np.zeros([1, output_dim]) for t in range(output_seq_len)})
    final_preds = sess.run(rnn_model['reshaped_outputs'], feed_dict)

    final_preds = np.concatenate(final_preds, axis = 1)

We will then visualize the predictions against the true labels below:

l1, = plt.plot(range(85), true_signal(train_data_x[:85]), label = 'Training truth')
l2, = plt.plot(range(85, 105), y[85:], 'yo', label = 'Test truth')
l3, = plt.plot(range(85, 105), final_preds.reshape(-1), 'ro', label = 'Test predictions')
plt.legend(handles = [l1, l2, l3], loc = 'lower left')
plt.show()

test_case_uni

Through a few iterations of learning, the model has already figured out the hidden signal from the noisy training data, which is amazing.

univariate_noisy_training_hidden_truth.png

Multivariate case

This one should be more exciting!

So what if we have more than one signals for input and output? And this is especially true for real world scenario. Let’s say we would like to predict the supply & demand for the next 30 minutes; what affect the results may not only be the past supply & demand, but also for things like weather, rain, temperature, app usages, traffic signals, whether it is a public holiday, etc.

1. Simulate input and output signals

We start with two cosine and sine signals x1 and x2 – these are our outputs to predict! From them, we derive three additional signals – y1, y2, y2 – as our input sequences, using some random formula as below:

x = np.linspace(0, 40, 130)
x1 = 2*np.sin(x)
x2 = 2*np.cos(x)

## random formula to generate input sequences
y1 = 1.6*x1**4 - 2*x2 - 10
y2 = 1.2*x2**2*x1 + 6*x2 - 6*x1
y3 = 2*x1**3 + 2*x2**3 - x1*x2

So overall, the structure will be like 3-in-2-out.

We can also visualize inputs as follow:

three_signals_multi_rgm

Put it another way, we want our model to uncover the hidden signals from the observed ones, as plotted below:

hidden_facts

This is more interesting as we have reached stage of being flexible both on the number of signals as well as length of sequences.

You are free to explore different number of input and output sequences, if you want to.

2. Sampling training data

Let’s rewrite data sampling mechanism:

input_seq_len = 15
output_seq_len = 20
x = np.linspace(0, 40, 130)
train_data_x = x[:(130 - output_seq_len)]

def true_output_signals(x):
    x1 = 2 * np.sin(x)
    x2 = 2 * np.cos(x)
    return x1, x2

def true_input_signals(x):
    x1, x2 = true_output_signals(x)
    y1 = 1.6*x1**4 - 2*x2 - 10
    y2 = 1.2*x2**2 * x1 + 2*x2*3 - x1*6
    y3 = 2*x1**3 + 2*x2**3 - x1*x2
    return y1, y2, y3

def noise_func(x, noise_factor = 2):
    return np.random.randn(len(x)) * noise_factor

def generate_samples_for_output(x):
    x1, x2 = true_output_signals(x)
    return x1+noise_func(x1, 0.5), \
           x2+noise_func(x2, 0.5)

def generate_samples_for_input(x):
    y1, y2, y3 = true_input_signals(x)
    return y1+noise_func(y1, 2), \
           y2+noise_func(y2, 2), \
           y3+noise_func(y3, 2)

def generate_train_samples(x = train_data_x, batch_size = 10):

    total_start_points = len(x) - input_seq_len - output_seq_len
    start_x_idx = np.random.choice(range(total_start_points), batch_size)

    input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx]
    output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx]

    input_seq_y = [generate_samples_for_input(x) for x in input_seq_x]
    output_seq_y = [generate_samples_for_output(x) for x in output_seq_x]

    ## return shape: (batch_size, time_steps, feature_dims)
    return np.array(input_seq_y).transpose(0, 2, 1), np.array(output_seq_y).transpose(0, 2, 1)

As usual, we can visualize one training sample:

train sample multi

3. Build the model

Luckily, we don’t need to change too much of the code for building the graph. What we need to change are only the input_dim = 3 and output_dim = 2.

I will skip pasting the same codes here.

4. Train the model

Training is also no much different. We will again set the ‘feed_previous‘ as False for ‘guided’ training – feeding in the correct decoder input at each time step. So here’s the code:

total_iteractions = 100
batch_size = 16
KEEP_RATE = 0.5
train_losses = []
val_losses = []

x = np.linspace(0, 40, 130)
train_data_x = x[:110]

rnn_model = build_graph(feed_previous=False)

saver = tf.train.Saver()

init = tf.global_variables_initializer()
with tf.Session() as sess:

    sess.run(init)

    print("Training losses: ")
    for i in range(total_iteractions):
        batch_input, batch_output = generate_train_samples(batch_size=batch_size)

        feed_dict = {rnn_model['enc_inp'][t]: batch_input[:,t] for t in range(input_seq_len)}
        feed_dict.update({rnn_model['target_seq'][t]: batch_output[:,t] for t in range(output_seq_len)})
        _, loss_t = sess.run([rnn_model['train_op'], rnn_model['loss']], feed_dict)
        print(loss_t)

    temp_saver = rnn_model['saver']()
    save_path = temp_saver.save(sess, os.path.join('/home/weimin/seq2seq', 'multivariate_ts_model0'))

print("Checkpoint saved at: ", save_path)

5. Inference and visualisation

Once training is done, we can visualise the prediction performance on the test output sequence.

test_results_multi_outputOnly

To put them together, we can also visualize both the input and output sequences at the same time (I scaled up the output sequences just to make them more obvious):

test_case_results_multi

Outliers / Extreme events

1. What could be an outlier / extreme event case?

outlier

This will be very useful for companies when are dealing with real time traffic, for example. There are many factors that could be the triggers for anomaly events, for example, the sudden rain of the day could lead to surge of demand of taxis on the road. Public holidays could be another factor which might lead to increase in demands for the whole day.

2. Case & data simulation

Let’s assume our traffic patterns are like below, where y-axis is the number of traffic, and x-axis is the days.

Image that during public holidays, we will notice an additional 2 units of traffic will be added on top of what we have for that day, as reflected by those sharp peaks in the graph.

traffic_days_extreme

Of course, the real data will have noise for both traffic on normal days as well as on public holidays (which means the true effect of PH – 2 units – is unknown to us, which the model needs to learn through the data). What we do know, however, is whether each day is a public holiday (1) or not (0). Here’s the code for data generation:

input_seq_len = 15
output_seq_len = 20
extreme_factor = 2 # 2 additional units of traffic increase if extreme / outlier

x = np.linspace(0, 60, 210)
train_data_x = x[:190]

np.random.seed(10)
num_events_train = 40 # total number of extreme events in training data
train_events_bool = np.zeros(190)
train_events_x = np.random.choice(range(190), num_events_train)
train_events_bool[train_events_x] = 1

num_events_test = 1 # total number of extreme events in test data
test_events_bool = np.zeros(20)
test_events_x = np.random.choice(range(20), num_events_test)
test_events_bool[test_events_bool] = 1

def true_signal(x):
    y = 2 * np.sin(x)
    return y

def noise_func(x, noise_factor = 1):
    return np.random.randn(len(x)) * noise_factor

def generate_y_values(x, event_bool):
    return true_signal(x) + noise_func(x) + extreme_factor * event_bool

def generate_train_samples(x = train_data_x, batch_size = 10, input_seq_len = input_seq_len, output_seq_len = output_seq_len, train_events_bool = train_events_bool):

    total_start_points = len(x) - input_seq_len - output_seq_len
    start_x_idx = np.random.choice(range(total_start_points), batch_size)

    input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx]
    input_seq_event_bool = [train_events_bool[i:(i+input_seq_len)] for i in start_x_idx]

    output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx]
    output_seq_event_bool = [train_events_bool[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx]

    input_seq_y = [generate_y_values(x, event_bool) for x, event_bool in zip(input_seq_x, input_seq_event_bool)]
    output_seq_y = [generate_y_values(x, event_bool) for x, event_bool in zip(output_seq_x, output_seq_event_bool)]

    #batch_x = np.array([[true_signal()]])
    return np.array(input_seq_y), np.array(output_seq_y), np.array(input_seq_event_bool), np.array(output_seq_event_bool)

As always, we can sample a training data to see what it looks like:

input_seq, output_seq, input_seq_event_bool, output_seq_event_bool = generate_train_samples(batch_size=10)
l1, = plt.plot(range(15), input_seq[0], 'go', label = 'input sequence for one sample')
l2, = plt.plot(range(15, 35), output_seq[0], 'ro', label = 'output sequence for one sample')
plt.legend(handles = [l1, l2], loc = 'lower left')
plt.show()

one_sample_train_extreme_events

Quite noisy to see the extreme points though. However, you can simply turn off the noise factor (set it to zero) to see clearly the extreme events. But I will skip this verification step here just to save the space.

3. If we train the same seq2seq as before …

Now if we use the simulated data to train our seq2seq model previously, we won’t get a very good result. The one outlier in the test data will not be captured by the model. (It’s easy to understand this as the extreme events are totally random, and model has no idea which days they will fall on for both training and test phases.)

So if you test the model, you should see something similar to this:

extreme_events_without_bool_200itera.png

  1. The model has poor fit for test data. 
  2. The model has no awareness / differentiation on the outlier.  

4. Re-design the seq2seq

How do we redesign the model to deal with extreme events?

Since we already have the knowledge of whether each day is a public holiday, we can pass that ‘knowledge’ to the decoder at each time step i, to let the decoder learn the differentiation between normal days and public holidays. 

In summary, here are the steps:

  1. Get the bool vector for output sequence (‘1’ for public holiday, ‘0’ for normal days)
  2. During decoding, at each time step t, concatenate the bool value to the original decoder input as new input
  3. Feed the new input to decoder to generate output at t
  4. Continue until the decoding phase is done

We won’t change the rest parts such as ‘guided’ training, and we also won’t touch the encoding phase. With these steps implemented, we should be able to see our improved results on test data:

extreme_events_with_bool_2

Overall it fits well, and most importantly, the outlier point was perfectly captured!

Please refer to my github page for complete codes and notebook.

Case study on Beijing PM2.5 pollution forecasting

1. The case introduction.

Finally, we will apply what we learnt on a real-world data set – the hourly data set that contains the PM2.5 data of US Embassy in Beijing.

If can download the data from here. The complete list of features included in the raw data set is:

  1. No: row number
  2. year: year of data in this row
  3. month: month of data in this row
  4. day: day of data in this row
  5. hour: hour of data in this row
  6. pm2.5: PM2.5 concentration
  7. DEWP: Dew Point
  8. TEMP: Temperature
  9. PRES: Pressure
  10. cbwd: Combined wind direction
  11. Iws: Cumulated wind speed
  12. Is: Cumulated hours of snow
  13. Ir: Cumulated hours of rain

Our task here, however, is only to predict the pollution factor – pm2.5 – for the next few hours, based on historical trends of other features. In short, it’s going to be a multi-variate case with n-input-1-ouput.

2. Visualisation and pre-processing of data

Beijing_data_head.png

beijing_vis

A quick df.head() function plus simple plots give us an intuitive understanding of the data set.

All features plotted above are numeric / continuous. However, there’s one more called cbwd which is categorical, because it is referring to the wind directions. We will do one-hot encoding for that. I will not be using data such as year, month, day or hour as my features. However, you are encouraged to add in those as features as well, to see if those might help your performance.

Further more, there are no NAs values for all features except for pm2.5 – there are 2067 out of 43824 are NAs in pm2.5. And we will simply fill NAs with 0.

Regarding train and test splits – I will use last one month data for test, and all data before that as my training sample. Feel free to split according to your preferences.

Finally, I will do a simply processing step to normalize all data using z-score.

The overall scripts are below:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

df = pd.read_csv('./PRSA_data_2010.1.1-2014.12.31.csv')
print(df.head())

cols_to_plot = ["pm2.5", "DEWP", "TEMP", "PRES", "Iws", "Is", "Ir"]
i = 1
# plot each column
plt.figure(figsize = (10,12))
for col in cols_to_plot:
    plt.subplot(len(cols_to_plot), 1, i)
    plt.plot(df[col])
    plt.title(col, y=0.5, loc='left')
    i += 1
plt.show()

## Fill NA with 0
#print(df.isnull().sum())
df.fillna(0, inplace = True)

## One-hot encode 'cbwd'
temp = pd.get_dummies(df['cbwd'], prefix='cbwd')
df = pd.concat([df, temp], axis = 1)
del df['cbwd'], temp

## Split into train and test - I used the last 1 month data as test, but it's up to you to decide the ratio
df_train = df.iloc[:(-31*24), :].copy()
df_test = df.iloc[-31*24:, :].copy()

## take out the useful columns for modeling - you may also keep 'hour', 'day' or 'month' and to see if that will improve your accuracy
X_train = df_train.loc[:, ['pm2.5', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Is', 'Ir', 'cbwd_NE', 'cbwd_NW', 'cbwd_SE', 'cbwd_cv']].values.copy()
X_test = df_test.loc[:, ['pm2.5', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Is', 'Ir', 'cbwd_NE', 'cbwd_NW', 'cbwd_SE', 'cbwd_cv']].values.copy()
y_train = df_train['pm2.5'].values.copy().reshape(-1, 1)
y_test = df_test['pm2.5'].values.copy().reshape(-1, 1)

## z-score transform x - not including those one-how columns!
for i in range(X_train.shape[1]-4):
    temp_mean = X_train[:, i].mean()
    temp_std = X_train[:, i].std()
    X_train[:, i] = (X_train[:, i] - temp_mean) / temp_std
    X_test[:, i] = (X_test[:, i] - temp_mean) / temp_std

## z-score transform y
y_mean = y_train.mean()
y_std = y_train.std()
y_train = (y_train - y_mean) / y_std
y_test = (y_test - y_mean) / y_std

3. Transforming training and test data into 3-D formats.

Now we have X_train, y_train, X_test, y_test as our train and test data sets, we will then need to transform them into 3-D formats for time series – (batch_size, time_steps, feature_dim)

We will prepare two util functions to do so:

input_seq_len = 30
output_seq_len = 5

def generate_train_samples(x = X_train, y = y_train, batch_size = 10, input_seq_len = input_seq_len, output_seq_len = output_seq_len):

    total_start_points = len(x) - input_seq_len - output_seq_len
    start_x_idx = np.random.choice(range(total_start_points), batch_size, replace = False)

    input_batch_idxs = [list(range(i, i+input_seq_len)) for i in start_x_idx]
    input_seq = np.take(x, input_batch_idxs, axis = 0)

    output_batch_idxs = [list(range(i+input_seq_len, i+input_seq_len+output_seq_len)) for i in start_x_idx]
    output_seq = np.take(y, output_batch_idxs, axis = 0)

    return input_seq, output_seq # in shape: (batch_size, time_steps, feature_dim)

def generate_test_samples(x = X_test, y = y_test, input_seq_len = input_seq_len, output_seq_len = output_seq_len):

    total_samples = x.shape[0]

    input_batch_idxs = [list(range(i, i+input_seq_len)) for i in range((total_samples-input_seq_len-output_seq_len))]
    input_seq = np.take(x, input_batch_idxs, axis = 0)

    output_batch_idxs = [list(range(i+input_seq_len, i+input_seq_len+output_seq_len)) for i in range((total_samples-input_seq_len-output_seq_len))]
    output_seq = np.take(y, output_batch_idxs, axis = 0)

    return input_seq, output_seq

So generate_train_samples will randomly sample batches from training, and generate_test_samples will simply convert all test data into 3-D formats in sequence. We can print out the shapes of the processed data:

x, y = generate_train_samples()
print(x.shape, y.shape)

test_x, test_y = generate_test_samples()
print(test_x.shape, test_y.shape)
(10, 30, 11) (10, 5, 1)
(709, 30, 11) (709, 5, 1)

4. Train the model and visualize the results on test.

The model to use will be no different than the multi-variate one we used before. So I will not duplicate that part here.

Training part is also similar, we can set feed_previous to False for guided training.

After training is done, we will visualize the results over the last month to evaluate our prediction performance, as below:

Beijing_final_result.png

Well, slightly off but not too bad :) The orange is predicted and blue is actual value for each hour. I believe with some tweaks on the hyperparameters one might get a better results!

Any thoughts?

In this tutorial, we have demonstrated how to use sequence to sequence architecture to learn and predict Time Series signals. We learnt that seq2seq model really has the power to sift through the noisy information, uncover the true patterns, and eventually comes down to discovering the hidden signals that are actually governing the data generation mechanism.

We have seen that the model actually performs equally well on both univariate as well as multivariate cases, with the additional capability of encoding and decoding variable lengths of signals. We have also witnessed the flexibility of re-designing the seq2seq model to take care of the extreme events scenario, which is powerful predict the ‘unpredictability’ of the real world.

What other problems may be more challenging and useful to tackle in your business settings? Please kindly share your ideas or thoughts, and we will see how to design the seq2seq to fit your needs.

References

TensorFlow official repo – https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py

Another excellent blog about seq2seq for Time Series –    https://github.com/guillaume-chevalier/seq2seq-signal-prediction

If you would like to try Keras –                    https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

Categories Uncategorized

42 thoughts on “The Amazing Effectiveness of Sequence to Sequence Model for Time Series

  1. Great post. I am a beginner in Data Science/Machine Learning but I can’t wait to deepen my knowledge in multi variate time-series forecasts. It appears to be a very promising field. I’m going to print your post and read it carefully multiple times until I’ll be ready to deploy my own neural network with tensorflow.

    Like

    1. Glad to hear that! I am happy that this post might help you in some way and wish you great progress in ML!

      Like

  2. Hi Weimin,
    This is an excellent post, thank you!
    Do you think it would be possible to feed in the next day’s weather forecast to the decoder similar to how you’ve fed in the public holiday Boolean value?
    This would make the system very flexible – for example an expert could create a ‘best-guess’ time series for the next day and feed that into the decoder to help it out. There would be many other possible inputs to provide here.
    Regards,
    Michael J

    Like

  3. Hi Weimin,
    This is an excellent post, thank you!
    Do you think it would be possible to feed in the next day’s weather forecast to the decoder similar to how you’ve fed in the public holiday Boolean value?
    This would make the system very flexible – for example an expert could create a ‘best-guess’ time series for the next day and feed that into the decoder to help it out. There would be many other possible inputs to provide here.
    Regards,
    Michael J

    Like

    1. Hi Michael,

      Thanks for your comment. I would strongly encourage you to try this and see whether this can improve the forecast accuracy compared to model with no extra input, and share your results:). Yes indeed there are other possible inputs to provide in addition to boolean values like you said. And it will also depend on the data set.

      Thanks,
      Weimin

      Like

      1. I think I’ll give it a shot!
        I found a paper that uses the same technique, passing in the day, week and hour to the decoder:
        https://arxiv.org/abs/1610.09460
        It looks like they achieved good results.

        Like

      2. Thanks for the sharing. I checked the paper and it was quite an interesting case.

        I think I may also try this dataset to forecast weather polution – https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data

        If I can get some interesting results, I will add it in to the post too. Stay tuned!

        Like

  4. Hi Weimin,

    Congrats for this clear and very useful post !
    Do you think the attention wrapper could be added ? I mean, does it make sense ?

    Many thanks

    Like

    1. Thanks for your comments!

      I thought about attention before but I feel it may not help much. Because unlike translation problem where you have alignment issues, the time series output here does not have similar patterns where the earlier part of output sequence is more relevant to earlier part of input, and so on.

      Like

  5. Weimin,

    This is the best time series tutorial I’ve seen so far, you just didn’t show a sine wave with noise in a 1-to-1 prediction. You went a lot further!
    Thanks for providing the code, keep up with the good job!

    Like

  6. Weimin,

    Thanks for your excellent article. I tried to use your method for my time series prediction and I am running into overfitting problems (works great on training data but poorly on test data). I played around with number of neurons and lamda values. I read that randomly dropping out neurons might help me. Any thoughts on that ?

    Like

    1. It might be due to your training samples are too small or your features are too simple.

      And yes, adding dropout may help as well.

      Like

  7. Thanks. Is it too much to ask for an example how dropouts can be implemented in your codes ?

    Like

  8. Hi Weimin, Thanks for sharing the code and tutorial.
    If the problem is pure sequence to sequence mapping, can this model be used to solve the problem. For instance, one training pair sample is like x = [0.001, -0.11, 200.001], and y =[0.00, 0.123, 0.00]. Physically, y can be considered as a functional mapping of x, but y does not have a time dependence relationship as x. This is different with the data set that you are using.
    The test case will be also a string like test = [-200, 0.0023, -120], and we would like to get mapping string of this “test” string. In other words, the use case is more like pure sequence-to-sequence mapping without time factor. Can this code be used to solve this kind of use case?

    Like

    1. I think so, as long as both input and output sequences are ordered list of elements. The reason it solves time series problem well is also due to the ordering natural of data – the model don’t really know whether it is every second or every minute, as long as it is ordered.

      Like

  9. Weimin, Thanks for your post and sharing your code.

    I am just a beginner who has just started.
    I tested the code by changing only the input data in your code.
    However, the resulting graph was very different.
    I am still studying, but I do not know why.
    So if you can contact me at the below email address, can you tell me the problem of the code I sent you?
    My E-mail Address is onlytheworld1@gmail.com.
    I actually use the same code you shared, I don’t know why the predictions are too different.
    Please help me.
    Thank you.

    Like

    1. Hi, if you use different data sets, please make sure you have normalized / scaled them correctly before you feed them to the model.

      Like

  10. Hi Weiming, in your model, there are several parameters that define the model architecture and training process
    learning_rate = 0.01
    lambda_l2_reg = 0.003
    # size of LSTM Cell
    hidden_dim = 64
    # num of stacked lstm layers
    num_stacked_layers = 2
    # gradient clipping – to avoid gradient exploding
    GRADIENT_CLIPPING = 2.5

    Are there any generic rules and experiences that you can share for modifying these parameters.
    For instance, there has a use case, where both input length and output length are 120, there are about 2000 training pairs, what may be a reasonable

    Like

    1. I am not aware of generic rule to select the hyper-parameters. I would recommend using validation / CV to help guide you on params selection. The number of stacked lstm layers and size of LSTM cell depend on the complexity of your data set. And you can start the remaining params by using my values as default.

      Like

  11. Mr. Wang,

    What a great guide. I have — rather unsuccessfully — attempted to train a model to predict speed-profiles (i.e. time-series depicting the speed of a vehicle) using ordinary RNN networks. I have been looking all over the internet for alternatives and I think sequence-to-sequence models might be the next thing to try.

    I wonder if you could give me some input on a implementation specific issue that I have. I am trying to implement a model which takes as input a single sample and outputs “all remaining samples”. The idea is that a vehicle has an initial speed v0, which is fed to the model. The model returns [v1, v2, …, vN]. At the next step in time, the model gets the true value of v1 (because we have it stored during training, and sample it in real time during inference) and returns [v2, v3, …, vN, 0].

    My plan is to use a model like the one in this sketch -> https://i.imgur.com/otVdYgR.jpg <- but I lack in understading of how to implement such a model. I have the fullest understanding for you not having time to dwell into this, but if you find this interesting I would greatly appreciate any comments of yours regarding either feasibility of the model/task or actual ideas regarding how one could achieve such a model.

    Like

  12. Interesting idea. I understood that you have the current / previous value of speed as input to your model at each time stamp t to predict for the next m time stamps, but since this is sequence to sequence, why don’t you consider using previous n speed values (v0, v1, …, vn-1) as input for the model instead, to predict the speed values for the next m time stamps (vn, vn+1, …, vn+m-1)?
    Also, you might also consider having more features in addition to speed value to feed into your model. For example, engine power, acceleration, petrol level, etc.. Am just guessing.

    Like

  13. Dear Mr. Wang,
    Thank you very much for your post. I have first a general question: am I wrong or are the only initialized weights and biases those that pertain to the decoder layer? Don’t we have to initialize randomly the input weights too? What does the encoder do exactly in this type of architecture? Is it a sort of “autoencoder feature extractor” that selects only the relevant inputs from the data and feeds it to the decoder?
    Thank you.

    Like

    1. The LSTM cells in both encoder and decoder are initialised with the same type and shape, but their weights are not shared during training. In addition, we also initialise transforming weight and biase, as in my `weights` and `biases`. The input gate weights are initialized in LSTM cells as well.

      Like

      1. Thank you very much for your reply. So the different weights of all LSTM cells are initialized the same way as Weight-out and biases_out, although we do not do it explicitly cell by cell.
        What about the encoder role please? Is it the same as a feature extractor than extracts the relevant information and stores them in a vector state before feeding them to the decoder?
        Regards

        Like

      2. The LSTM cells’ weights matrix are constructed in the graph upon calling of `_basic_rnn_seq2seq` function – the rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype) does the magic. It will do so for both encoder and decoder. However, all variables are initialized only when you executed tf.global_variables_initializer() in your session.

        Like

  14. Also, I am facing a cery stange issue with my dataset. the cost during training phase decreases and eventually drops to zero however the weights and biases I restore from the Saver are all equal to zero (the weights have been initialized randomly and the data has been scaled between -1 and 1). Has someone already faced this issue?
    Thank you very much for you help.

    Like

  15. Dear all,
    I have an issue and connot find the solution, the test loss does not decrease at all and all my weights at the end of the training phase are zéros. I am using the logique above with my own dataset after normalization. The trainig is done on the first two years and the test phase on the last year. I use a batch size of 6 months for training and testing and an epoch number of 1000, 5000 ….
    Cannot past by loss graph but hope it’s clear…
    Help please !!

    Like

  16. Hello, Thank you this nice tutorial. Could you please explain the difference (and the Relationship) between input_size and batch_size parameters?
    regards

    Like

    1. input_size: dimension of your input data. i.e. number of input time series signals.
      batch_size: number of training samples in each training batch.
      each training sample will have fixed number of time steps, and at each time step, it has a fixed dimension which is input_size.

      Like

      1. Thank you Weimin for this great post.
        I have to one question:
        Similar to the above question, does this batch_size means the number of input samples into the encoder(with 30 timesteps)? To my knowledge, this batch_size in the article is not the same as the batch size that we usually use while training a common feedforward neural network. Am i right?
        In the pollution case, you use 16 as the batch size, that means that the number of training data is only 16 with timesteps of encoder as 30 and timesteps of decoder as 5. It seems the number of training data choosen from {len(X_train) – input_seq_len – output_seq_len = 43824-30-5} is too less, comparing to the complete length of the whole dateset, for forecasting the next 5 steps. How come it can show such good results? I’m very confused.
        Can you help to explain this? Thank you very much.

        Like

  17. How well do these models tend to work on prediction of multiple different-length time series? For example, if your training data is a set of 1000 time series, each of varying lengths (from two time points to a million time points) and each with n-number of features that are vectors of the same length time series.

    Like

  18. Hi,
    Great Blog. I learned a lot.
    I added 2 dropout layers and the mse lowered for me.
    If its possible, can you add a tutorial for seq2seq in keras (with attention ) ?

    Like

    1. Glad that you like it.

      Will try to do keras in the future, but for now I believe there are definitely good ones out there you can find. The reason I used tf instead of keras for this one is because I can get down to the lower level details of the model, which will be a good learning case.

      Like

  19. Hi Weiming,

    Thanks for sharing so much insight in this tutorial. It has helped me get a concrete understanding on RNN forecast for time series data. Just have one question about the last example with real world dataset. The default activation function for lstm is tanh, which is supposed to be in the range of (-1, 1). But the forecast does reach much higher value than 1 before scaling with mean and std. This observation has been validated in my own dataset. What am I missing here?

    Thanks,
    Hang

    Like

    1. You have one more dense layer connection to transform LSTM cell state to output, which will scale the ranges to match target.

      Like

  20. How to get train and validation losses to record at the training time ( for tensorboard visualization )
    because we have different graphs in training and evaluation time, do we have to use
    Three models in three graphs, with three Sessions sharing the same Variables like mentioned in
    here: https://www.tensorflow.org/tutorials/seq2seq#building_training_eval_and_inference_graphs
    or is there any simpler method?

    Like

  21. Great sharing and thanks a lot! And I try the code matching with my dataset and run well. I will try to use attention for seq2seq later if any improvement. One small comment also, is it better to use replace=False for np.random.choice, which to ensure all unique records for training.

    Like

  22. Hi Weimin,
    This post helps me a lot when I am searching LSTM. As a student I strongly want to share your post to my classmates.I was wondering if it is possbile for you to authorize me to translate it into Chinese.I will mark out your name and source address.
    Thanks for your work.:)
    Li

    Like

    1. I am happy to hear it helps. Please feel free to do so :)

      Like

  23. Hi Wang, Thanks for this great article. I was trying to use it for a seq2seq prediction. However, I was unable to fix the reproducibility. Every time, I run the model with the same parameters I get a different result. I can fix the training data using numpy seeds. But I was failed to use any tf random seeds and reproduce the same result for same params. Can you please help with that?

    Thanks again for a good article. Have a great time!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close