In this tutorial, I am excited to showcase examples of **building Time Series forecasting model with seq2seq** in TensorFlow. The purpose of this post is to give an intuitive as well as technical understanding of the implementations, and to demonstrate the two useful features under the hood:

**Multivariate input and output signals****Variable input and output sequence lengths**

After that, I will demonstrate how to apply what we have covered on something very interesting – **to** **predict extreme events / outliers. **

Finally, we will end this tutorial by applying what we learnt into a real world case – **forecasting Beijing pm2.5 pollution.**

At any time, please feel free to jump to python notebook at my github if you want to skip reading.

Organisational structures:

- Univariate case
- Multivariate case
- Predict extreme events / outliers
- Case study on real world data set – forecasting Beijing PM2.5 pollution

## A Univariate case

### 1. Data generation approach

Let’s say we want to learn the pattern of a sinusoidal wave like below:

However, the real world data might be way more noisy than this, as shown below:

So we will sample data in batch from training, with each data as an **input-output-sequence pair**. The input and output **lengths** can be **different** as well. After training, we will feed the model with input sequence of test, and let the model give the predicted final **output sequence**.

The code for generating the training data are as follows:

input_seq_len = 15 output_seq_len = 20 x = np.linspace(0, 30, 105) train_data_x = x[:(105-output_seq_len)] def true_signal(x): y = 2 * np.sin(x) return y def noise_func(x, noise_factor = 0.5): return np.random.randn(len(x)) * noise_factor def generate_y_values(x): return true_signal(x) + noise_func(x) def generate_train_samples(x = train_data_x, batch_size = 10, input_seq_len = 15, output_seq_len = 20): total_start_points = len(x) - input_seq_len - output_seq_len start_x_idx = np.random.choice(range(total_start_points), batch_size) input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx] output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx] input_seq_y = [generate_y_values(x) for x in input_seq_x] output_seq_y = [generate_y_values(x) for x in output_seq_x] ## return shape: (batch_size, time_steps, feature_dims) return np.array(input_seq_y), np.array(output_seq_y)

Please note that we chose input and output sequence lengths to be **15** and **20**, respectively. The last 20 elements from the array of ‘x’ are reserved to **generate the test labels** later on after we made the predictions, so we are not using them for training.

To see what the training data looks like, we can sample and visualise one pair of them:

input_seq, output_seq = generate_train_samples(batch_size=100) l1, = plt.plot(range(15), input_seq[1], 'go', label = 'input sequence for one sample') l2, = plt.plot(range(15, 35), output_seq[1], 'ro', label = 'output sequence for one sample') plt.legend(handles = [l1, l2], loc = 'lower left') plt.show()

### 2. The seq2seq model and how we train it

What we will do next is similar to what’s depicted above. The seq2seq model contains two RNNs, e.g. LSTMs, which are the ‘encoder’ and ‘decoder’. Each box here is one RNN cell. The encoder will first run through the whole input sequence (the ‘A’, ‘B’, ‘C’ here), encode them into a fixed length vector – the last hidden state of encoder LSTM – and pass that vector to decoder to decode into the output sequence. (the ‘W’, ‘X’, ‘Y’, ‘Z)

The training approach we adopt here is called **‘guided’ **training: basically, during decoding steps, we will first fit a **‘GO’** token as **initial input to decoder (this can be a zeros vector)**, and subsequently, we will provide the correct input to the decoder at every time-step, even if the decoder made a mistake before. However, **during test or inference** time, the output of the decoder at time t is fed back and becomes the input of the decoder at time t+1.

The code is below:

def _rnn_decoder(decoder_inputs, initial_state, cell, loop_function=None, scope=None): """RNN decoder for the sequence-to-sequence model. Args: decoder_inputs: A list of 2D Tensors [batch_size x input_size]. initial_state: 2D Tensor with shape [batch_size x cell.state_size]. cell: rnn_cell.RNNCell defining the cell function and size. loop_function: If not None, this function will be applied to the i-th output in order to generate the i+1-st input, and decoder_inputs will be ignored, except for the first element ("GO" symbol). This can be used for decoding, but also for training to emulate http://arxiv.org/abs/1506.03099. Signature -- loop_function(prev, i) = next * prev is a 2D Tensor of shape [batch_size x output_size], * i is an integer, the step number (when advanced control is needed), * next is a 2D Tensor of shape [batch_size x input_size]. scope: VariableScope for the created subgraph; defaults to "rnn_decoder". Returns: A tuple of the form (outputs, state), where: outputs: A list of the same length as decoder_inputs of 2D Tensors with shape [batch_size x output_size] containing generated outputs. state: The state of each cell at the final time-step. It is a 2D Tensor of shape [batch_size x cell.state_size]. (Note that in some cases, like basic RNN cell or GRU cell, outputs and states can be the same. They are different for LSTM cells though.) """ with variable_scope.variable_scope(scope or "rnn_decoder"): state = initial_state outputs = [] prev = None for i, inp in enumerate(decoder_inputs): if loop_function is not None and prev is not None: with variable_scope.variable_scope("loop_function", reuse=True): inp = loop_function(prev, i) if i > 0: variable_scope.get_variable_scope().reuse_variables() output, state = cell(inp, state) outputs.append(output) if loop_function is not None: prev = output return outputs, state def _basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell, feed_previous, dtype=dtypes.float32, scope=None): """Basic RNN sequence-to-sequence model. This model first runs an RNN to encode encoder_inputs into a state vector, then runs decoder, initialized with the last encoder state, on decoder_inputs. Encoder and decoder use the same RNN cell type, but don't share parameters. Args: encoder_inputs: A list of 2D Tensors [batch_size x input_size]. decoder_inputs: A list of 2D Tensors [batch_size x input_size]. feed_previous: Boolean; if True, only the first of decoder_inputs will be used (the "GO" symbol), all other inputs will be generated by the previous decoder output using _loop_function below. If False, decoder_inputs are used as given (the standard decoder case). dtype: The dtype of the initial state of the RNN cell (default: tf.float32). scope: VariableScope for the created subgraph; default: "basic_rnn_seq2seq". Returns: A tuple of the form (outputs, state), where: outputs: A list of the same length as decoder_inputs of 2D Tensors with shape [batch_size x output_size] containing the generated outputs. state: The state of each decoder cell in the final time-step. It is a 2D Tensor of shape [batch_size x cell.state_size]. """ with variable_scope.variable_scope(scope or "basic_rnn_seq2seq"): enc_cell = copy.deepcopy(cell) _, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype) if feed_previous: return _rnn_decoder(decoder_inputs, enc_state, cell, _loop_function) else: return _rnn_decoder(decoder_inputs, enc_state, cell) def _loop_function(prev, _): '''Naive implementation of loop function for _rnn_decoder. Transform prev from dimension [batch_size x hidden_dim] to [batch_size x output_dim], which will be used as decoder input of next time step ''' return tf.matmul(prev, weights['out']) + biases['out']

This is basically the same function from the official seq2seq repo. However, I did make a change to the **basic_rnn_seq2seq** function by adding in the ‘**feed_previous**‘ feature, which has not yet been implemented in TensorFlow. This will be feeding decoder output at time t as input for time t+1, as discussed. So during the decoding of **inference** phase, only the first element of decoder_inputs will be used.

However, during training phase, **decoder_inputs** will still be used as input at each time step t.

The **_loop_function** simply serves as a util function to transform the decoder cell output into the dimensions of input, as the LSTM Cell may have different dimensions for both input and output. And we will define the **weights[‘out’]** and **biases[‘out’]** later on when we build the model.

Now with the core **_basic_rnn_seq2seq** function implemented, we can then move on to building the whole graph, including steps such as variables & placeholders declarations, formatting inputs & outputs, as well as computing loss and defining training ops.

Here’s the whole graph building:

from tensorflow.contrib import rnn from tensorflow.python.ops import variable_scope from tensorflow.python.framework import dtypes ## Parameters learning_rate = 0.01 lambda_l2_reg = 0.003 ## Network Parameters # length of input signals input_seq_len = 15 # length of output signals output_seq_len = 20 # size of LSTM Cell hidden_dim = 64 # num of input signals input_dim = 1 # num of output signals output_dim = 1 # num of stacked lstm layers num_stacked_layers = 2 # gradient clipping - to avoid gradient exploding GRADIENT_CLIPPING = 2.5 def build_graph(feed_previous = False): tf.reset_default_graph() global_step = tf.Variable( initial_value=0, name="global_step", trainable=False, collections=[tf.GraphKeys.GLOBAL_STEP, tf.GraphKeys.GLOBAL_VARIABLES]) weights = { 'out': tf.get_variable('Weights_out', \ shape = [hidden_dim, output_dim], \ dtype = tf.float32, \ initializer = tf.truncated_normal_initializer()), } biases = { 'out': tf.get_variable('Biases_out', \ shape = [output_dim], \ dtype = tf.float32, \ initializer = tf.constant_initializer(0.)), } with tf.variable_scope('Seq2seq'): # Encoder: inputs enc_inp = [ tf.placeholder(tf.float32, shape=(None, input_dim), name="inp_{}".format(t)) for t in range(input_seq_len) ] # Decoder: target outputs target_seq = [ tf.placeholder(tf.float32, shape=(None, output_dim), name="y".format(t)) for t in range(output_seq_len) ] # Give a "GO" token to the decoder. # If dec_inp are fed into decoder as inputs, this is 'guided' training; otherwise only the # first element will be fed as decoder input which is then 'un-guided' dec_inp = [ tf.zeros_like(target_seq[0], dtype=tf.float32, name="GO") ] + target_seq[:-1] with tf.variable_scope('LSTMCell'): cells = [] for i in range(num_stacked_layers): with tf.variable_scope('RNN_{}'.format(i)): cells.append(tf.contrib.rnn.LSTMCell(hidden_dim)) cell = tf.contrib.rnn.MultiRNNCell(cells) def _rnn_decoder(decoder_inputs, initial_state, cell, loop_function=None, scope=None): """RNN decoder for the sequence-to-sequence model. Args: decoder_inputs: A list of 2D Tensors [batch_size x input_size]. initial_state: 2D Tensor with shape [batch_size x cell.state_size]. cell: rnn_cell.RNNCell defining the cell function and size. loop_function: If not None, this function will be applied to the i-th output in order to generate the i+1-st input, and decoder_inputs will be ignored, except for the first element ("GO" symbol). This can be used for decoding, but also for training to emulate http://arxiv.org/abs/1506.03099. Signature -- loop_function(prev, i) = next * prev is a 2D Tensor of shape [batch_size x output_size], * i is an integer, the step number (when advanced control is needed), * next is a 2D Tensor of shape [batch_size x input_size]. scope: VariableScope for the created subgraph; defaults to "rnn_decoder". Returns: A tuple of the form (outputs, state), where: outputs: A list of the same length as decoder_inputs of 2D Tensors with shape [batch_size x output_size] containing generated outputs. state: The state of each cell at the final time-step. It is a 2D Tensor of shape [batch_size x cell.state_size]. (Note that in some cases, like basic RNN cell or GRU cell, outputs and states can be the same. They are different for LSTM cells though.) """ with variable_scope.variable_scope(scope or "rnn_decoder"): state = initial_state outputs = [] prev = None for i, inp in enumerate(decoder_inputs): if loop_function is not None and prev is not None: with variable_scope.variable_scope("loop_function", reuse=True): inp = loop_function(prev, i) if i > 0: variable_scope.get_variable_scope().reuse_variables() output, state = cell(inp, state) outputs.append(output) if loop_function is not None: prev = output return outputs, state def _basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell, feed_previous, dtype=dtypes.float32, scope=None): """Basic RNN sequence-to-sequence model. This model first runs an RNN to encode encoder_inputs into a state vector, then runs decoder, initialized with the last encoder state, on decoder_inputs. Encoder and decoder use the same RNN cell type, but don't share parameters. Args: encoder_inputs: A list of 2D Tensors [batch_size x input_size]. decoder_inputs: A list of 2D Tensors [batch_size x input_size]. feed_previous: Boolean; if True, only the first of decoder_inputs will be used (the "GO" symbol), all other inputs will be generated by the previous decoder output using _loop_function below. If False, decoder_inputs are used as given (the standard decoder case). dtype: The dtype of the initial state of the RNN cell (default: tf.float32). scope: VariableScope for the created subgraph; default: "basic_rnn_seq2seq". Returns: A tuple of the form (outputs, state), where: outputs: A list of the same length as decoder_inputs of 2D Tensors with shape [batch_size x output_size] containing the generated outputs. state: The state of each decoder cell in the final time-step. It is a 2D Tensor of shape [batch_size x cell.state_size]. """ with variable_scope.variable_scope(scope or "basic_rnn_seq2seq"): enc_cell = copy.deepcopy(cell) _, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype) if feed_previous: return _rnn_decoder(decoder_inputs, enc_state, cell, _loop_function) else: return _rnn_decoder(decoder_inputs, enc_state, cell) def _loop_function(prev, _): '''Naive implementation of loop function for _rnn_decoder. Transform prev from dimension [batch_size x hidden_dim] to [batch_size x output_dim], which will be used as decoder input of next time step ''' return tf.matmul(prev, weights['out']) + biases['out'] dec_outputs, dec_memory = _basic_rnn_seq2seq( enc_inp, dec_inp, cell, feed_previous = feed_previous ) reshaped_outputs = [tf.matmul(i, weights['out']) + biases['out'] for i in dec_outputs] # Training loss and optimizer with tf.variable_scope('Loss'): # L2 loss output_loss = 0 for _y, _Y in zip(reshaped_outputs, target_seq): output_loss += tf.reduce_mean(tf.pow(_y - _Y, 2)) # L2 regularization for weights and biases reg_loss = 0 for tf_var in tf.trainable_variables(): if 'Biases_' in tf_var.name or 'Weights_' in tf_var.name: reg_loss += tf.reduce_mean(tf.nn.l2_loss(tf_var)) loss = output_loss + lambda_l2_reg * reg_loss with tf.variable_scope('Optimizer'): optimizer = tf.contrib.layers.optimize_loss( loss=loss, learning_rate=learning_rate, global_step=global_step, optimizer='Adam', clip_gradients=GRADIENT_CLIPPING) saver = tf.train.Saver return dict( enc_inp = enc_inp, target_seq = target_seq, train_op = optimizer, loss=loss, saver = saver, reshaped_outputs = reshaped_outputs, )

The ‘**feed_previous**‘ will be set to **True** during inference phase, and **False** during training.

The input and output dims are both 1s for our univariate case, and input and output sequences can have different lengths in this case.

### 3. Train the model

We will use the ‘guided’ training approach as discussed previously:

total_iteractions = 100 batch_size = 16 KEEP_RATE = 0.5 train_losses = [] val_losses = [] x = np.linspace(0, 30, 105) train_data_x = x[:85] rnn_model = build_graph(feed_previous=False) saver = tf.train.Saver() init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) for i in range(total_iteractions): batch_input, batch_output = generate_train_samples(batch_size=batch_size) feed_dict = {rnn_model['enc_inp'][t]: batch_input[:,t].reshape(-1,input_dim) for t in range(input_seq_len)} feed_dict.update({rnn_model['target_seq'][t]: batch_output[:,t].reshape(-1,output_dim) for t in range(output_seq_len)}) _, loss_t = sess.run([rnn_model['train_op'], rnn_model['loss']], feed_dict) print(loss_t) temp_saver = rnn_model['saver']() save_path = temp_saver.save(sess, os.path.join('/home/weimin/seq2seq', 'univariate_ts_model0')) print("Checkpoint saved at: ", save_path)

And we are seeing our training loss decreases as expected:

41.1176 15.1229 38.0447 17.5147 10.7143 15.86 11.0018 9.36498 8.26214 ... 6.23474 5.99408 6.12859 5.7535 5.7275 6.19146 Checkpoint saved at: /home/weimin/seq2seq/univariate_ts_model0

### 4. Inference and visualization

For prediction, we will use the last sequence of 15 values from **train_data_x** as input signal, and let the model predict without feeding in the true labels (by setting ‘**feed_previous**‘ to True)

test_seq_input = true_signal(train_data_x[-15:]) rnn_model = build_graph(feed_previous=True) init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) saver = rnn_model['saver']().restore(sess, os.path.join('/home/weimin/seq2seq', 'univariate_ts_model0')) feed_dict = {rnn_model['enc_inp'][t]: test_seq_input[t].reshape(1,1) for t in range(input_seq_len)} feed_dict.update({rnn_model['target_seq'][t]: np.zeros([1, output_dim]) for t in range(output_seq_len)}) final_preds = sess.run(rnn_model['reshaped_outputs'], feed_dict) final_preds = np.concatenate(final_preds, axis = 1)

We will then visualize the predictions against the true labels below:

l1, = plt.plot(range(85), true_signal(train_data_x[:85]), label = 'Training truth') l2, = plt.plot(range(85, 105), y[85:], 'yo', label = 'Test truth') l3, = plt.plot(range(85, 105), final_preds.reshape(-1), 'ro', label = 'Test predictions') plt.legend(handles = [l1, l2, l3], loc = 'lower left') plt.show()

Through a few iterations of learning, the model has already figured out the hidden signal from the noisy training data, which is amazing.

## Multivariate case

This one should be more exciting!

So what if we have more than one signals for input and output? And this is especially true for real world scenario. Let’s say we would like to predict the supply & demand for the next 30 minutes; what affect the results may not only be the past supply & demand, but also for things like weather, rain, temperature, app usages, traffic signals, whether it is a public holiday, etc.

### 1. Simulate input and output signals

We start with two cosine and sine signals **x1** and **x2** – these are our outputs to predict! From them, we derive three additional signals – **y1, y2, y2** – as our input sequences, using some random formula as below:

x = np.linspace(0, 40, 130) x1 = 2*np.sin(x) x2 = 2*np.cos(x) ## random formula to generate input sequences y1 = 1.6*x1**4 - 2*x2 - 10 y2 = 1.2*x2**2*x1 + 6*x2 - 6*x1 y3 = 2*x1**3 + 2*x2**3 - x1*x2

So overall, the structure will be like **3-in-2-out**.

We can also visualize inputs as follow:

Put it another way, we want our model to **uncover** the hidden signals from the observed ones, as plotted below:

This is more interesting as we have reached stage of being flexible both on the **number of signals** as well as **length of sequences**.

You are free to explore different number of input and output sequences, if you want to.

### 2. Sampling training data

Let’s rewrite data sampling mechanism:

input_seq_len = 15 output_seq_len = 20 x = np.linspace(0, 40, 130) train_data_x = x[:(130 - output_seq_len)] def true_output_signals(x): x1 = 2 * np.sin(x) x2 = 2 * np.cos(x) return x1, x2 def true_input_signals(x): x1, x2 = true_output_signals(x) y1 = 1.6*x1**4 - 2*x2 - 10 y2 = 1.2*x2**2 * x1 + 2*x2*3 - x1*6 y3 = 2*x1**3 + 2*x2**3 - x1*x2 return y1, y2, y3 def noise_func(x, noise_factor = 2): return np.random.randn(len(x)) * noise_factor def generate_samples_for_output(x): x1, x2 = true_output_signals(x) return x1+noise_func(x1, 0.5), \ x2+noise_func(x2, 0.5) def generate_samples_for_input(x): y1, y2, y3 = true_input_signals(x) return y1+noise_func(y1, 2), \ y2+noise_func(y2, 2), \ y3+noise_func(y3, 2) def generate_train_samples(x = train_data_x, batch_size = 10): total_start_points = len(x) - input_seq_len - output_seq_len start_x_idx = np.random.choice(range(total_start_points), batch_size) input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx] output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx] input_seq_y = [generate_samples_for_input(x) for x in input_seq_x] output_seq_y = [generate_samples_for_output(x) for x in output_seq_x] ## return shape: (batch_size, time_steps, feature_dims) return np.array(input_seq_y).transpose(0, 2, 1), np.array(output_seq_y).transpose(0, 2, 1)

As usual, we can visualize** one training sample**:

### 3. Build the model

Luckily, we don’t need to change too much of the code for building the graph. What we need to change are only the **input_dim = 3** and **output_dim** **= 2**.

I will skip pasting the same codes here.

### 4. Train the model

Training is also no much different. We will again set the ‘**feed_previous**‘ as **False** for ‘guided’ training – feeding in the correct decoder input at each time step. So here’s the code:

total_iteractions = 100 batch_size = 16 KEEP_RATE = 0.5 train_losses = [] val_losses = [] x = np.linspace(0, 40, 130) train_data_x = x[:110] rnn_model = build_graph(feed_previous=False) saver = tf.train.Saver() init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) print("Training losses: ") for i in range(total_iteractions): batch_input, batch_output = generate_train_samples(batch_size=batch_size) feed_dict = {rnn_model['enc_inp'][t]: batch_input[:,t] for t in range(input_seq_len)} feed_dict.update({rnn_model['target_seq'][t]: batch_output[:,t] for t in range(output_seq_len)}) _, loss_t = sess.run([rnn_model['train_op'], rnn_model['loss']], feed_dict) print(loss_t) temp_saver = rnn_model['saver']() save_path = temp_saver.save(sess, os.path.join('/home/weimin/seq2seq', 'multivariate_ts_model0')) print("Checkpoint saved at: ", save_path)

### 5. Inference and visualisation

Once training is done, we can visualise the prediction performance on the test output sequence.

To put them together, we can also visualize both the input and output sequences at the same time (I scaled up the output sequences just to make them more obvious):

## Outliers / Extreme events

### 1. What could be an outlier / extreme event case?

This will be very useful for companies when are dealing with real time traffic, for example. There are many factors that could be the triggers for anomaly events, for example, **the sudden rain of the day** could lead to surge of demand of taxis on the road. **Public holidays** could be another factor which might lead to increase in demands for the whole day.

### 2. Case & data simulation

Let’s assume our traffic patterns are like below, where y-axis is the number of traffic, and x-axis is the days.

Image that during public holidays, we will notice an additional 2 units of traffic will be added on top of what we have for that day, as reflected by those sharp peaks in the graph.

Of course, the real data will have noise for both traffic on normal days as well as on public holidays (which means the true effect of PH – 2 units – is unknown to us, which the model needs to learn through the data). What we do know, however, is whether each day is a public holiday (1) or not (0). Here’s the code for data generation:

input_seq_len = 15 output_seq_len = 20 extreme_factor = 2 # 2 additional units of traffic increase if extreme / outlier x = np.linspace(0, 60, 210) train_data_x = x[:190] np.random.seed(10) num_events_train = 40 # total number of extreme events in training data train_events_bool = np.zeros(190) train_events_x = np.random.choice(range(190), num_events_train) train_events_bool[train_events_x] = 1 num_events_test = 1 # total number of extreme events in test data test_events_bool = np.zeros(20) test_events_x = np.random.choice(range(20), num_events_test) test_events_bool[test_events_bool] = 1 def true_signal(x): y = 2 * np.sin(x) return y def noise_func(x, noise_factor = 1): return np.random.randn(len(x)) * noise_factor def generate_y_values(x, event_bool): return true_signal(x) + noise_func(x) + extreme_factor * event_bool def generate_train_samples(x = train_data_x, batch_size = 10, input_seq_len = input_seq_len, output_seq_len = output_seq_len, train_events_bool = train_events_bool): total_start_points = len(x) - input_seq_len - output_seq_len start_x_idx = np.random.choice(range(total_start_points), batch_size) input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx] input_seq_event_bool = [train_events_bool[i:(i+input_seq_len)] for i in start_x_idx] output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx] output_seq_event_bool = [train_events_bool[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx] input_seq_y = [generate_y_values(x, event_bool) for x, event_bool in zip(input_seq_x, input_seq_event_bool)] output_seq_y = [generate_y_values(x, event_bool) for x, event_bool in zip(output_seq_x, output_seq_event_bool)] #batch_x = np.array([[true_signal()]]) return np.array(input_seq_y), np.array(output_seq_y), np.array(input_seq_event_bool), np.array(output_seq_event_bool)

As always, we can sample a training data to see what it looks like:

input_seq, output_seq, input_seq_event_bool, output_seq_event_bool = generate_train_samples(batch_size=10) l1, = plt.plot(range(15), input_seq[0], 'go', label = 'input sequence for one sample') l2, = plt.plot(range(15, 35), output_seq[0], 'ro', label = 'output sequence for one sample') plt.legend(handles = [l1, l2], loc = 'lower left') plt.show()

Quite noisy to see the extreme points though. However, you can simply turn off the noise factor (set it to zero) to see clearly the extreme events. But I will skip this verification step here just to save the space.

### 3. If we train the same seq2seq as before …

Now if we use the simulated data to train our seq2seq model previously, we won’t get a very good result. The one outlier in the test data will not be captured by the model. (It’s easy to understand this as the extreme events are totally random, and model has no idea which days they will fall on for both training and test phases.)

So if you test the model, you should see something similar to this:

**The model has poor fit for test data.****The model has no awareness / differentiation on the outlier.**

### 4. Re-design the seq2seq

How do we redesign the model to deal with extreme events?

**Since we already have the knowledge of whether each day is a public holiday, we can pass that ‘knowledge’ to the decoder at each time step i, to let the decoder learn the differentiation between normal days and public holidays. **

In summary, here are the steps:

- Get the bool vector for output sequence (‘1’ for public holiday, ‘0’ for normal days)
- During decoding, at each time step t, concatenate the bool value to the original decoder input as new input
- Feed the new input to decoder to generate output at t
- Continue until the decoding phase is done

We won’t change the rest parts such as ‘guided’ training, and we also won’t touch the encoding phase. With these steps implemented, we should be able to see our improved results on test data:

Overall it fits well, and most importantly, the outlier point was perfectly captured!

Please refer to my github page for complete codes and notebook.

## Case study on Beijing PM2.5 pollution forecasting

### 1. The case introduction.

Finally, we will apply what we learnt on a real-world data set – the hourly data set that contains the PM2.5 data of US Embassy in Beijing.

If can download the data from here. The complete list of features included in the raw data set is:

**No**: row number**year**: year of data in this row**month**: month of data in this row**day**: day of data in this row**hour**: hour of data in this row**pm2.5**: PM2.5 concentration**DEWP**: Dew Point**TEMP**: Temperature**PRES**: Pressure**cbwd**: Combined wind direction**Iws**: Cumulated wind speed**Is**: Cumulated hours of snow**Ir**: Cumulated hours of rain

Our task here, however, is only to predict the pollution factor – **pm2.5** – for the next few hours, based on historical trends of other features. In short, it’s going to be a multi-variate case with **n-input-1-ouput**.

### 2. Visualisation and pre-processing of data

A quick df.head() function plus simple plots give us an intuitive understanding of the data set.

All features plotted above are numeric / continuous. However, there’s one more called **cbwd **which is categorical, because it is referring to the wind directions. We will do **one-hot encoding** for that. I will not be using data such as year, month, day or hour as my features. However, you are encouraged to add in those as features as well, to see if those might help your performance.

Further more, there are no NAs values for all features **except for** **pm2.5 – **there are **2067** out of **43824** are **NAs** in pm2.5. And we will simply fill NAs with 0.

Regarding **train and test splits** – I will use **last one month data for test**, and all data before that as my training sample. Feel free to split according to your preferences.

Finally, I will do a simply processing step to normalize all data using **z-score**.

The overall scripts are below:

import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns df = pd.read_csv('./PRSA_data_2010.1.1-2014.12.31.csv') print(df.head()) cols_to_plot = ["pm2.5", "DEWP", "TEMP", "PRES", "Iws", "Is", "Ir"] i = 1 # plot each column plt.figure(figsize = (10,12)) for col in cols_to_plot: plt.subplot(len(cols_to_plot), 1, i) plt.plot(df[col]) plt.title(col, y=0.5, loc='left') i += 1 plt.show() ## Fill NA with 0 #print(df.isnull().sum()) df.fillna(0, inplace = True) ## One-hot encode 'cbwd' temp = pd.get_dummies(df['cbwd'], prefix='cbwd') df = pd.concat([df, temp], axis = 1) del df['cbwd'], temp ## Split into train and test - I used the last 1 month data as test, but it's up to you to decide the ratio df_train = df.iloc[:(-31*24), :].copy() df_test = df.iloc[-31*24:, :].copy() ## take out the useful columns for modeling - you may also keep 'hour', 'day' or 'month' and to see if that will improve your accuracy X_train = df_train.loc[:, ['pm2.5', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Is', 'Ir', 'cbwd_NE', 'cbwd_NW', 'cbwd_SE', 'cbwd_cv']].values.copy() X_test = df_test.loc[:, ['pm2.5', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Is', 'Ir', 'cbwd_NE', 'cbwd_NW', 'cbwd_SE', 'cbwd_cv']].values.copy() y_train = df_train['pm2.5'].values.copy().reshape(-1, 1) y_test = df_test['pm2.5'].values.copy().reshape(-1, 1) ## z-score transform x - not including those one-how columns! for i in range(X_train.shape[1]-4): temp_mean = X_train[:, i].mean() temp_std = X_train[:, i].std() X_train[:, i] = (X_train[:, i] - temp_mean) / temp_std X_test[:, i] = (X_test[:, i] - temp_mean) / temp_std ## z-score transform y y_mean = y_train.mean() y_std = y_train.std() y_train = (y_train - y_mean) / y_std y_test = (y_test - y_mean) / y_std

### 3. Transforming training and test data into 3-D formats.

Now we have X_train, y_train, X_test, y_test as our train and test data sets, we will then need to transform them into 3-D formats for time series –** (batch_size, time_steps, feature_dim)**

We will prepare two util functions to do so:

input_seq_len = 30 output_seq_len = 5 def generate_train_samples(x = X_train, y = y_train, batch_size = 10, input_seq_len = input_seq_len, output_seq_len = output_seq_len): total_start_points = len(x) - input_seq_len - output_seq_len start_x_idx = np.random.choice(range(total_start_points), batch_size, replace = False) input_batch_idxs = [list(range(i, i+input_seq_len)) for i in start_x_idx] input_seq = np.take(x, input_batch_idxs, axis = 0) output_batch_idxs = [list(range(i+input_seq_len, i+input_seq_len+output_seq_len)) for i in start_x_idx] output_seq = np.take(y, output_batch_idxs, axis = 0) return input_seq, output_seq # in shape: (batch_size, time_steps, feature_dim) def generate_test_samples(x = X_test, y = y_test, input_seq_len = input_seq_len, output_seq_len = output_seq_len): total_samples = x.shape[0] input_batch_idxs = [list(range(i, i+input_seq_len)) for i in range((total_samples-input_seq_len-output_seq_len))] input_seq = np.take(x, input_batch_idxs, axis = 0) output_batch_idxs = [list(range(i+input_seq_len, i+input_seq_len+output_seq_len)) for i in range((total_samples-input_seq_len-output_seq_len))] output_seq = np.take(y, output_batch_idxs, axis = 0) return input_seq, output_seq

So generate_train_samples will randomly sample batches from training, and generate_test_samples will simply convert all test data into 3-D formats in sequence. We can print out the shapes of the processed data:

x, y = generate_train_samples() print(x.shape, y.shape) test_x, test_y = generate_test_samples() print(test_x.shape, test_y.shape)

(10, 30, 11) (10, 5, 1)

(709, 30, 11) (709, 5, 1)

### 4. Train the model and visualize the results on test.

The model to use will be no different than the multi-variate one we used before. So I will not duplicate that part here.

Training part is also similar, we can set **feed_previous** to **False** for guided training.

After training is done, we will visualize the results over the last month to evaluate our prediction performance, as below:

Well, slightly off but not too bad :) The orange is predicted and blue is actual value for each hour. I believe with some tweaks on the hyperparameters one might get a better results!

## Any thoughts?

In this tutorial, we have demonstrated how to use sequence to sequence architecture to learn and predict Time Series signals. We learnt that seq2seq model really has the power to *sift through the noisy information*, *uncover the true patterns*, and eventually comes down to *discovering the hidden signals* that are actually governing the data generation mechanism.

We have seen that the model actually performs equally well on both univariate as well as multivariate cases, with the additional capability of encoding and decoding variable lengths of signals. We have also witnessed the flexibility of re-designing the seq2seq model to take care of the extreme events scenario, which is powerful predict the ‘unpredictability’ of the real world.

What other problems may be more challenging and useful to tackle in your business settings? Please kindly share your ideas or thoughts, and we will see how to design the seq2seq to fit your needs.

### References

**TensorFlow official repo –** https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py

**Another excellent blog about seq2seq for Time Series – **https://github.com/guillaume-chevalier/seq2seq-signal-prediction

**If you would like to try Keras – ** https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

Great post. I am a beginner in Data Science/Machine Learning but I can’t wait to deepen my knowledge in multi variate time-series forecasts. It appears to be a very promising field. I’m going to print your post and read it carefully multiple times until I’ll be ready to deploy my own neural network with tensorflow.

LikeLike

Glad to hear that! I am happy that this post might help you in some way and wish you great progress in ML!

LikeLike

Hi Weimin,

This is an excellent post, thank you!

Do you think it would be possible to feed in the next day’s weather forecast to the decoder similar to how you’ve fed in the public holiday Boolean value?

This would make the system very flexible – for example an expert could create a ‘best-guess’ time series for the next day and feed that into the decoder to help it out. There would be many other possible inputs to provide here.

Regards,

Michael J

LikeLike

Hi Weimin,

This is an excellent post, thank you!

Do you think it would be possible to feed in the next day’s weather forecast to the decoder similar to how you’ve fed in the public holiday Boolean value?

This would make the system very flexible – for example an expert could create a ‘best-guess’ time series for the next day and feed that into the decoder to help it out. There would be many other possible inputs to provide here.

Regards,

Michael J

LikeLike

Hi Michael,

Thanks for your comment. I would strongly encourage you to try this and see whether this can improve the forecast accuracy compared to model with no extra input, and share your results:). Yes indeed there are other possible inputs to provide in addition to boolean values like you said. And it will also depend on the data set.

Thanks,

Weimin

LikeLike

I think I’ll give it a shot!

I found a paper that uses the same technique, passing in the day, week and hour to the decoder:

https://arxiv.org/abs/1610.09460

It looks like they achieved good results.

LikeLike

Thanks for the sharing. I checked the paper and it was quite an interesting case.

I think I may also try this dataset to forecast weather polution – https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data

If I can get some interesting results, I will add it in to the post too. Stay tuned!

LikeLike

Hi Weimin,

Congrats for this clear and very useful post !

Do you think the attention wrapper could be added ? I mean, does it make sense ?

Many thanks

LikeLike

Thanks for your comments!

I thought about attention before but I feel it may not help much. Because unlike translation problem where you have alignment issues, the time series output here does not have similar patterns where the earlier part of output sequence is more relevant to earlier part of input, and so on.

LikeLike

Weimin,

This is the best time series tutorial I’ve seen so far, you just didn’t show a sine wave with noise in a 1-to-1 prediction. You went a lot further!

Thanks for providing the code, keep up with the good job!

LikeLike

Glad you like it!

LikeLike

Weimin,

Thanks for your excellent article. I tried to use your method for my time series prediction and I am running into overfitting problems (works great on training data but poorly on test data). I played around with number of neurons and lamda values. I read that randomly dropping out neurons might help me. Any thoughts on that ?

LikeLike

It might be due to your training samples are too small or your features are too simple.

And yes, adding dropout may help as well.

LikeLike

Thanks. Is it too much to ask for an example how dropouts can be implemented in your codes ?

LikeLike

Hi Weimin, Thanks for sharing the code and tutorial.

If the problem is pure sequence to sequence mapping, can this model be used to solve the problem. For instance, one training pair sample is like x = [0.001, -0.11, 200.001], and y =[0.00, 0.123, 0.00]. Physically, y can be considered as a functional mapping of x, but y does not have a time dependence relationship as x. This is different with the data set that you are using.

The test case will be also a string like test = [-200, 0.0023, -120], and we would like to get mapping string of this “test” string. In other words, the use case is more like pure sequence-to-sequence mapping without time factor. Can this code be used to solve this kind of use case?

LikeLike

I think so, as long as both input and output sequences are ordered list of elements. The reason it solves time series problem well is also due to the ordering natural of data – the model don’t really know whether it is every second or every minute, as long as it is ordered.

LikeLike

Weimin, Thanks for your post and sharing your code.

I am just a beginner who has just started.

I tested the code by changing only the input data in your code.

However, the resulting graph was very different.

I am still studying, but I do not know why.

So if you can contact me at the below email address, can you tell me the problem of the code I sent you?

My E-mail Address is onlytheworld1@gmail.com.

I actually use the same code you shared, I don’t know why the predictions are too different.

Please help me.

Thank you.

LikeLike

Hi, if you use different data sets, please make sure you have normalized / scaled them correctly before you feed them to the model.

LikeLike

Hi Weiming, in your model, there are several parameters that define the model architecture and training process

learning_rate = 0.01

lambda_l2_reg = 0.003

# size of LSTM Cell

hidden_dim = 64

# num of stacked lstm layers

num_stacked_layers = 2

# gradient clipping – to avoid gradient exploding

GRADIENT_CLIPPING = 2.5

Are there any generic rules and experiences that you can share for modifying these parameters.

For instance, there has a use case, where both input length and output length are 120, there are about 2000 training pairs, what may be a reasonable

LikeLike

I am not aware of generic rule to select the hyper-parameters. I would recommend using validation / CV to help guide you on params selection. The number of stacked lstm layers and size of LSTM cell depend on the complexity of your data set. And you can start the remaining params by using my values as default.

LikeLike

Mr. Wang,

What a great guide. I have — rather unsuccessfully — attempted to train a model to predict speed-profiles (i.e. time-series depicting the speed of a vehicle) using ordinary RNN networks. I have been looking all over the internet for alternatives and I think sequence-to-sequence models might be the next thing to try.

I wonder if you could give me some input on a implementation specific issue that I have. I am trying to implement a model which takes as input a single sample and outputs “all remaining samples”. The idea is that a vehicle has an initial speed v0, which is fed to the model. The model returns [v1, v2, …, vN]. At the next step in time, the model gets the true value of v1 (because we have it stored during training, and sample it in real time during inference) and returns [v2, v3, …, vN, 0].

My plan is to use a model like the one in this sketch -> https://i.imgur.com/otVdYgR.jpg <- but I lack in understading of how to implement such a model. I have the fullest understanding for you not having time to dwell into this, but if you find this interesting I would greatly appreciate any comments of yours regarding either feasibility of the model/task or actual ideas regarding how one could achieve such a model.

LikeLike

Interesting idea. I understood that you have the current / previous value of speed as input to your model at each time stamp t to predict for the next m time stamps, but since this is sequence to sequence, why don’t you consider using previous n speed values (v0, v1, …, vn-1) as input for the model instead, to predict the speed values for the next m time stamps (vn, vn+1, …, vn+m-1)?

Also, you might also consider having more features in addition to speed value to feed into your model. For example, engine power, acceleration, petrol level, etc.. Am just guessing.

LikeLike

Dear Mr. Wang,

Thank you very much for your post. I have first a general question: am I wrong or are the only initialized weights and biases those that pertain to the decoder layer? Don’t we have to initialize randomly the input weights too? What does the encoder do exactly in this type of architecture? Is it a sort of “autoencoder feature extractor” that selects only the relevant inputs from the data and feeds it to the decoder?

Thank you.

LikeLike

The LSTM cells in both encoder and decoder are initialised with the same type and shape, but their weights are not shared during training. In addition, we also initialise transforming weight and biase, as in my `weights` and `biases`. The input gate weights are initialized in LSTM cells as well.

LikeLike

Thank you very much for your reply. So the different weights of all LSTM cells are initialized the same way as Weight-out and biases_out, although we do not do it explicitly cell by cell.

What about the encoder role please? Is it the same as a feature extractor than extracts the relevant information and stores them in a vector state before feeding them to the decoder?

Regards

LikeLike

The LSTM cells’ weights matrix are constructed in the graph upon calling of `_basic_rnn_seq2seq` function – the rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype) does the magic. It will do so for both encoder and decoder. However, all variables are initialized only when you executed tf.global_variables_initializer() in your session.

LikeLike

Also, I am facing a cery stange issue with my dataset. the cost during training phase decreases and eventually drops to zero however the weights and biases I restore from the Saver are all equal to zero (the weights have been initialized randomly and the data has been scaled between -1 and 1). Has someone already faced this issue?

Thank you very much for you help.

LikeLike

Dear all,

I have an issue and connot find the solution, the test loss does not decrease at all and all my weights at the end of the training phase are zéros. I am using the logique above with my own dataset after normalization. The trainig is done on the first two years and the test phase on the last year. I use a batch size of 6 months for training and testing and an epoch number of 1000, 5000 ….

Cannot past by loss graph but hope it’s clear…

Help please !!

LikeLike

Hello, Thank you this nice tutorial. Could you please explain the difference (and the Relationship) between input_size and batch_size parameters?

regards

LikeLike

input_size: dimension of your input data. i.e. number of input time series signals.

batch_size: number of training samples in each training batch.

each training sample will have fixed number of time steps, and at each time step, it has a fixed dimension which is input_size.

LikeLike

Thank you Weimin for this great post.

I have to one question:

Similar to the above question, does this batch_size means the number of input samples into the encoder(with 30 timesteps)? To my knowledge, this batch_size in the article is not the same as the batch size that we usually use while training a common feedforward neural network. Am i right?

In the pollution case, you use 16 as the batch size, that means that the number of training data is only 16 with timesteps of encoder as 30 and timesteps of decoder as 5. It seems the number of training data choosen from {len(X_train) – input_seq_len – output_seq_len = 43824-30-5} is too less, comparing to the complete length of the whole dateset, for forecasting the next 5 steps. How come it can show such good results? I’m very confused.

Can you help to explain this? Thank you very much.

LikeLike

How well do these models tend to work on prediction of multiple different-length time series? For example, if your training data is a set of 1000 time series, each of varying lengths (from two time points to a million time points) and each with n-number of features that are vectors of the same length time series.

LikeLike

Hi,

Great Blog. I learned a lot.

I added 2 dropout layers and the mse lowered for me.

If its possible, can you add a tutorial for seq2seq in keras (with attention ) ?

LikeLike

Glad that you like it.

Will try to do keras in the future, but for now I believe there are definitely good ones out there you can find. The reason I used tf instead of keras for this one is because I can get down to the lower level details of the model, which will be a good learning case.

LikeLike