In this tutorial, I am excited to showcase examples of **building Time Series forecasting model with seq2seq** in TensorFlow. The purpose of this post is to give an intuitive as well as technical understanding of the implementations, and to demonstrate the two useful features under the hood:

**Multivariate input and output signals****Variable input and output sequence lengths**

After that, I will demonstrate how to apply what we have covered on something very interesting – **to** **predict extreme events / outliers. **

Finally, we will end this tutorial by applying what we learnt into a real world case – **forecasting Beijing pm2.5 pollution.**

At any time, please feel free to jump to python notebook at my github if you want to skip reading.

Organisational structures:

- Univariate case
- Multivariate case
- Predict extreme events / outliers
- Case study on real world data set – forecasting Beijing PM2.5 pollution

## A Univariate case

### 1. Data generation approach

Let’s say we want to learn the pattern of a sinusoidal wave like below:

However, the real world data might be way more noisy than this, as shown below:

So we will sample data in batch from training, with each data as an **input-output-sequence pair**. The input and output **lengths** can be **different** as well. After training, we will feed the model with input sequence of test, and let the model give the predicted final **output sequence**.

The code for generating the training data are as follows:

input_seq_len = 15 output_seq_len = 20 x = np.linspace(0, 30, 105) train_data_x = x[:(105-output_seq_len)] def true_signal(x): y = 2 * np.sin(x) return y def noise_func(x, noise_factor = 0.5): return np.random.randn(len(x)) * noise_factor def generate_y_values(x): return true_signal(x) + noise_func(x) def generate_train_samples(x = train_data_x, batch_size = 10, input_seq_len = 15, output_seq_len = 20): total_start_points = len(x) - input_seq_len - output_seq_len start_x_idx = np.random.choice(range(total_start_points), batch_size) input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx] output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx] input_seq_y = [generate_y_values(x) for x in input_seq_x] output_seq_y = [generate_y_values(x) for x in output_seq_x] ## return shape: (batch_size, time_steps, feature_dims) return np.array(input_seq_y), np.array(output_seq_y)

Please note that we chose input and output sequence lengths to be **15** and **20**, respectively. The last 20 elements from the array of ‘x’ are reserved to **generate the test labels** later on after we made the predictions, so we are not using them for training.

To see what the training data looks like, we can sample and visualise one pair of them:

input_seq, output_seq = generate_train_samples(batch_size=100) l1, = plt.plot(range(15), input_seq[1], 'go', label = 'input sequence for one sample') l2, = plt.plot(range(15, 35), output_seq[1], 'ro', label = 'output sequence for one sample') plt.legend(handles = [l1, l2], loc = 'lower left') plt.show()

### 2. The seq2seq model and how we train it

What we will do next is similar to what’s depicted above. The seq2seq model contains two RNNs, e.g. LSTMs, which are the ‘encoder’ and ‘decoder’. Each box here is one RNN cell. The encoder will first run through the whole input sequence (the ‘A’, ‘B’, ‘C’ here), encode them into a fixed length vector – the last hidden state of encoder LSTM – and pass that vector to decoder to decode into the output sequence. (the ‘W’, ‘X’, ‘Y’, ‘Z)

The training approach we adopt here is called **‘guided’ **training: basically, during decoding steps, we will first fit a **‘GO’** token as **initial input to decoder (this can be a zeros vector)**, and subsequently, we will provide the correct input to the decoder at every time-step, even if the decoder made a mistake before. However, **during test or inference** time, the output of the decoder at time t is fed back and becomes the input of the decoder at time t+1.

The code is below:

def _rnn_decoder(decoder_inputs, initial_state, cell, loop_function=None, scope=None): """RNN decoder for the sequence-to-sequence model. Args: decoder_inputs: A list of 2D Tensors [batch_size x input_size]. initial_state: 2D Tensor with shape [batch_size x cell.state_size]. cell: rnn_cell.RNNCell defining the cell function and size. loop_function: If not None, this function will be applied to the i-th output in order to generate the i+1-st input, and decoder_inputs will be ignored, except for the first element ("GO" symbol). This can be used for decoding, but also for training to emulate http://arxiv.org/abs/1506.03099. Signature -- loop_function(prev, i) = next * prev is a 2D Tensor of shape [batch_size x output_size], * i is an integer, the step number (when advanced control is needed), * next is a 2D Tensor of shape [batch_size x input_size]. scope: VariableScope for the created subgraph; defaults to "rnn_decoder". Returns: A tuple of the form (outputs, state), where: outputs: A list of the same length as decoder_inputs of 2D Tensors with shape [batch_size x output_size] containing generated outputs. state: The state of each cell at the final time-step. It is a 2D Tensor of shape [batch_size x cell.state_size]. (Note that in some cases, like basic RNN cell or GRU cell, outputs and states can be the same. They are different for LSTM cells though.) """ with variable_scope.variable_scope(scope or "rnn_decoder"): state = initial_state outputs = [] prev = None for i, inp in enumerate(decoder_inputs): if loop_function is not None and prev is not None: with variable_scope.variable_scope("loop_function", reuse=True): inp = loop_function(prev, i) if i > 0: variable_scope.get_variable_scope().reuse_variables() output, state = cell(inp, state) outputs.append(output) if loop_function is not None: prev = output return outputs, state def _basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell, feed_previous, dtype=dtypes.float32, scope=None): """Basic RNN sequence-to-sequence model. This model first runs an RNN to encode encoder_inputs into a state vector, then runs decoder, initialized with the last encoder state, on decoder_inputs. Encoder and decoder use the same RNN cell type, but don't share parameters. Args: encoder_inputs: A list of 2D Tensors [batch_size x input_size]. decoder_inputs: A list of 2D Tensors [batch_size x input_size]. feed_previous: Boolean; if True, only the first of decoder_inputs will be used (the "GO" symbol), all other inputs will be generated by the previous decoder output using _loop_function below. If False, decoder_inputs are used as given (the standard decoder case). dtype: The dtype of the initial state of the RNN cell (default: tf.float32). scope: VariableScope for the created subgraph; default: "basic_rnn_seq2seq". Returns: A tuple of the form (outputs, state), where: outputs: A list of the same length as decoder_inputs of 2D Tensors with shape [batch_size x output_size] containing the generated outputs. state: The state of each decoder cell in the final time-step. It is a 2D Tensor of shape [batch_size x cell.state_size]. """ with variable_scope.variable_scope(scope or "basic_rnn_seq2seq"): enc_cell = copy.deepcopy(cell) _, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype) if feed_previous: return _rnn_decoder(decoder_inputs, enc_state, cell, _loop_function) else: return _rnn_decoder(decoder_inputs, enc_state, cell) def _loop_function(prev, _): '''Naive implementation of loop function for _rnn_decoder. Transform prev from dimension [batch_size x hidden_dim] to [batch_size x output_dim], which will be used as decoder input of next time step ''' return tf.matmul(prev, weights['out']) + biases['out']

This is basically the same function from the official seq2seq repo. However, I did make a change to the **basic_rnn_seq2seq** function by adding in the ‘**feed_previous**‘ feature, which has not yet been implemented in TensorFlow. This will be feeding decoder output at time t as input for time t+1, as discussed. So during the decoding of **inference** phase, only the first element of decoder_inputs will be used.

However, during training phase, **decoder_inputs** will still be used as input at each time step t.

The **_loop_function** simply serves as a util function to transform the decoder cell output into the dimensions of input, as the LSTM Cell may have different dimensions for both input and output. And we will define the **weights[‘out’]** and **biases[‘out’]** later on when we build the model.

Now with the core **_basic_rnn_seq2seq** function implemented, we can then move on to building the whole graph, including steps such as variables & placeholders declarations, formatting inputs & outputs, as well as computing loss and defining training ops.

Here’s the whole graph building:

from tensorflow.contrib import rnn from tensorflow.python.ops import variable_scope from tensorflow.python.framework import dtypes ## Parameters learning_rate = 0.01 lambda_l2_reg = 0.003 ## Network Parameters # length of input signals input_seq_len = 15 # length of output signals output_seq_len = 20 # size of LSTM Cell hidden_dim = 64 # num of input signals input_dim = 1 # num of output signals output_dim = 1 # num of stacked lstm layers num_stacked_layers = 2 # gradient clipping - to avoid gradient exploding GRADIENT_CLIPPING = 2.5 def build_graph(feed_previous = False): tf.reset_default_graph() global_step = tf.Variable( initial_value=0, name="global_step", trainable=False, collections=[tf.GraphKeys.GLOBAL_STEP, tf.GraphKeys.GLOBAL_VARIABLES]) weights = { 'out': tf.get_variable('Weights_out', \ shape = [hidden_dim, output_dim], \ dtype = tf.float32, \ initializer = tf.truncated_normal_initializer()), } biases = { 'out': tf.get_variable('Biases_out', \ shape = [output_dim], \ dtype = tf.float32, \ initializer = tf.constant_initializer(0.)), } with tf.variable_scope('Seq2seq'): # Encoder: inputs enc_inp = [ tf.placeholder(tf.float32, shape=(None, input_dim), name="inp_{}".format(t)) for t in range(input_seq_len) ] # Decoder: target outputs target_seq = [ tf.placeholder(tf.float32, shape=(None, output_dim), name="y".format(t)) for t in range(output_seq_len) ] # Give a "GO" token to the decoder. # If dec_inp are fed into decoder as inputs, this is 'guided' training; otherwise only the # first element will be fed as decoder input which is then 'un-guided' dec_inp = [ tf.zeros_like(target_seq[0], dtype=tf.float32, name="GO") ] + target_seq[:-1] with tf.variable_scope('LSTMCell'): cells = [] for i in range(num_stacked_layers): with tf.variable_scope('RNN_{}'.format(i)): cells.append(tf.contrib.rnn.LSTMCell(hidden_dim)) cell = tf.contrib.rnn.MultiRNNCell(cells) def _rnn_decoder(decoder_inputs, initial_state, cell, loop_function=None, scope=None): """RNN decoder for the sequence-to-sequence model. Args: decoder_inputs: A list of 2D Tensors [batch_size x input_size]. initial_state: 2D Tensor with shape [batch_size x cell.state_size]. cell: rnn_cell.RNNCell defining the cell function and size. loop_function: If not None, this function will be applied to the i-th output in order to generate the i+1-st input, and decoder_inputs will be ignored, except for the first element ("GO" symbol). This can be used for decoding, but also for training to emulate http://arxiv.org/abs/1506.03099. Signature -- loop_function(prev, i) = next * prev is a 2D Tensor of shape [batch_size x output_size], * i is an integer, the step number (when advanced control is needed), * next is a 2D Tensor of shape [batch_size x input_size]. scope: VariableScope for the created subgraph; defaults to "rnn_decoder". Returns: A tuple of the form (outputs, state), where: outputs: A list of the same length as decoder_inputs of 2D Tensors with shape [batch_size x output_size] containing generated outputs. state: The state of each cell at the final time-step. It is a 2D Tensor of shape [batch_size x cell.state_size]. (Note that in some cases, like basic RNN cell or GRU cell, outputs and states can be the same. They are different for LSTM cells though.) """ with variable_scope.variable_scope(scope or "rnn_decoder"): state = initial_state outputs = [] prev = None for i, inp in enumerate(decoder_inputs): if loop_function is not None and prev is not None: with variable_scope.variable_scope("loop_function", reuse=True): inp = loop_function(prev, i) if i > 0: variable_scope.get_variable_scope().reuse_variables() output, state = cell(inp, state) outputs.append(output) if loop_function is not None: prev = output return outputs, state def _basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell, feed_previous, dtype=dtypes.float32, scope=None): """Basic RNN sequence-to-sequence model. This model first runs an RNN to encode encoder_inputs into a state vector, then runs decoder, initialized with the last encoder state, on decoder_inputs. Encoder and decoder use the same RNN cell type, but don't share parameters. Args: encoder_inputs: A list of 2D Tensors [batch_size x input_size]. decoder_inputs: A list of 2D Tensors [batch_size x input_size]. feed_previous: Boolean; if True, only the first of decoder_inputs will be used (the "GO" symbol), all other inputs will be generated by the previous decoder output using _loop_function below. If False, decoder_inputs are used as given (the standard decoder case). dtype: The dtype of the initial state of the RNN cell (default: tf.float32). scope: VariableScope for the created subgraph; default: "basic_rnn_seq2seq". Returns: A tuple of the form (outputs, state), where: outputs: A list of the same length as decoder_inputs of 2D Tensors with shape [batch_size x output_size] containing the generated outputs. state: The state of each decoder cell in the final time-step. It is a 2D Tensor of shape [batch_size x cell.state_size]. """ with variable_scope.variable_scope(scope or "basic_rnn_seq2seq"): enc_cell = copy.deepcopy(cell) _, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype) if feed_previous: return _rnn_decoder(decoder_inputs, enc_state, cell, _loop_function) else: return _rnn_decoder(decoder_inputs, enc_state, cell) def _loop_function(prev, _): '''Naive implementation of loop function for _rnn_decoder. Transform prev from dimension [batch_size x hidden_dim] to [batch_size x output_dim], which will be used as decoder input of next time step ''' return tf.matmul(prev, weights['out']) + biases['out'] dec_outputs, dec_memory = _basic_rnn_seq2seq( enc_inp, dec_inp, cell, feed_previous = feed_previous ) reshaped_outputs = [tf.matmul(i, weights['out']) + biases['out'] for i in dec_outputs] # Training loss and optimizer with tf.variable_scope('Loss'): # L2 loss output_loss = 0 for _y, _Y in zip(reshaped_outputs, target_seq): output_loss += tf.reduce_mean(tf.pow(_y - _Y, 2)) # L2 regularization for weights and biases reg_loss = 0 for tf_var in tf.trainable_variables(): if 'Biases_' in tf_var.name or 'Weights_' in tf_var.name: reg_loss += tf.reduce_mean(tf.nn.l2_loss(tf_var)) loss = output_loss + lambda_l2_reg * reg_loss with tf.variable_scope('Optimizer'): optimizer = tf.contrib.layers.optimize_loss( loss=loss, learning_rate=learning_rate, global_step=global_step, optimizer='Adam', clip_gradients=GRADIENT_CLIPPING) saver = tf.train.Saver return dict( enc_inp = enc_inp, target_seq = target_seq, train_op = optimizer, loss=loss, saver = saver, reshaped_outputs = reshaped_outputs, )

The ‘**feed_previous**‘ will be set to **True** during inference phase, and **False** during training.

The input and output dims are both 1s for our univariate case, and input and output sequences can have different lengths in this case.

### 3. Train the model

We will use the ‘guided’ training approach as discussed previously:

total_iteractions = 100 batch_size = 16 KEEP_RATE = 0.5 train_losses = [] val_losses = [] x = np.linspace(0, 30, 105) train_data_x = x[:85] rnn_model = build_graph(feed_previous=False) saver = tf.train.Saver() init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) for i in range(total_iteractions): batch_input, batch_output = generate_train_samples(batch_size=batch_size) feed_dict = {rnn_model['enc_inp'][t]: batch_input[:,t].reshape(-1,input_dim) for t in range(input_seq_len)} feed_dict.update({rnn_model['target_seq'][t]: batch_output[:,t].reshape(-1,output_dim) for t in range(output_seq_len)}) _, loss_t = sess.run([rnn_model['train_op'], rnn_model['loss']], feed_dict) print(loss_t) temp_saver = rnn_model['saver']() save_path = temp_saver.save(sess, os.path.join('/home/weimin/seq2seq', 'univariate_ts_model0')) print("Checkpoint saved at: ", save_path)

And we are seeing our training loss decreases as expected:

41.1176 15.1229 38.0447 17.5147 10.7143 15.86 11.0018 9.36498 8.26214 ... 6.23474 5.99408 6.12859 5.7535 5.7275 6.19146 Checkpoint saved at: /home/weimin/seq2seq/univariate_ts_model0

### 4. Inference and visualization

For prediction, we will use the last sequence of 15 values from **train_data_x** as input signal, and let the model predict without feeding in the true labels (by setting ‘**feed_previous**‘ to True)

test_seq_input = true_signal(train_data_x[-15:]) rnn_model = build_graph(feed_previous=True) init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) saver = rnn_model['saver']().restore(sess, os.path.join('/home/weimin/seq2seq', 'univariate_ts_model0')) feed_dict = {rnn_model['enc_inp'][t]: test_seq_input[t].reshape(1,1) for t in range(input_seq_len)} feed_dict.update({rnn_model['target_seq'][t]: np.zeros([1, output_dim]) for t in range(output_seq_len)}) final_preds = sess.run(rnn_model['reshaped_outputs'], feed_dict) final_preds = np.concatenate(final_preds, axis = 1)

We will then visualize the predictions against the true labels below:

l1, = plt.plot(range(85), true_signal(train_data_x[:85]), label = 'Training truth') l2, = plt.plot(range(85, 105), y[85:], 'yo', label = 'Test truth') l3, = plt.plot(range(85, 105), final_preds.reshape(-1), 'ro', label = 'Test predictions') plt.legend(handles = [l1, l2, l3], loc = 'lower left') plt.show()

Through a few iterations of learning, the model has already figured out the hidden signal from the noisy training data, which is amazing.

## Multivariate case

This one should be more exciting!

So what if we have more than one signals for input and output? And this is especially true for real world scenario. Let’s say we would like to predict the supply & demand for the next 30 minutes; what affect the results may not only be the past supply & demand, but also for things like weather, rain, temperature, app usages, traffic signals, whether it is a public holiday, etc.

### 1. Simulate input and output signals

We start with two cosine and sine signals **x1** and **x2** – these are our outputs to predict! From them, we derive three additional signals – **y1, y2, y2** – as our input sequences, using some random formula as below:

x = np.linspace(0, 40, 130) x1 = 2*np.sin(x) x2 = 2*np.cos(x) ## random formula to generate input sequences y1 = 1.6*x1**4 - 2*x2 - 10 y2 = 1.2*x2**2*x1 + 6*x2 - 6*x1 y3 = 2*x1**3 + 2*x2**3 - x1*x2

So overall, the structure will be like **3-in-2-out**.

We can also visualize inputs as follow:

Put it another way, we want our model to **uncover** the hidden signals from the observed ones, as plotted below:

This is more interesting as we have reached stage of being flexible both on the **number of signals** as well as **length of sequences**.

You are free to explore different number of input and output sequences, if you want to.

### 2. Sampling training data

Let’s rewrite data sampling mechanism:

input_seq_len = 15 output_seq_len = 20 x = np.linspace(0, 40, 130) train_data_x = x[:(130 - output_seq_len)] def true_output_signals(x): x1 = 2 * np.sin(x) x2 = 2 * np.cos(x) return x1, x2 def true_input_signals(x): x1, x2 = true_output_signals(x) y1 = 1.6*x1**4 - 2*x2 - 10 y2 = 1.2*x2**2 * x1 + 2*x2*3 - x1*6 y3 = 2*x1**3 + 2*x2**3 - x1*x2 return y1, y2, y3 def noise_func(x, noise_factor = 2): return np.random.randn(len(x)) * noise_factor def generate_samples_for_output(x): x1, x2 = true_output_signals(x) return x1+noise_func(x1, 0.5), \ x2+noise_func(x2, 0.5) def generate_samples_for_input(x): y1, y2, y3 = true_input_signals(x) return y1+noise_func(y1, 2), \ y2+noise_func(y2, 2), \ y3+noise_func(y3, 2) def generate_train_samples(x = train_data_x, batch_size = 10): total_start_points = len(x) - input_seq_len - output_seq_len start_x_idx = np.random.choice(range(total_start_points), batch_size) input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx] output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx] input_seq_y = [generate_samples_for_input(x) for x in input_seq_x] output_seq_y = [generate_samples_for_output(x) for x in output_seq_x] ## return shape: (batch_size, time_steps, feature_dims) return np.array(input_seq_y).transpose(0, 2, 1), np.array(output_seq_y).transpose(0, 2, 1)

As usual, we can visualize** one training sample**:

### 3. Build the model

Luckily, we don’t need to change too much of the code for building the graph. What we need to change are only the **input_dim = 3** and **output_dim** **= 2**.

I will skip pasting the same codes here.

### 4. Train the model

Training is also no much different. We will again set the ‘**feed_previous**‘ as **False** for ‘guided’ training – feeding in the correct decoder input at each time step. So here’s the code:

total_iteractions = 100 batch_size = 16 KEEP_RATE = 0.5 train_losses = [] val_losses = [] x = np.linspace(0, 40, 130) train_data_x = x[:110] rnn_model = build_graph(feed_previous=False) saver = tf.train.Saver() init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) print("Training losses: ") for i in range(total_iteractions): batch_input, batch_output = generate_train_samples(batch_size=batch_size) feed_dict = {rnn_model['enc_inp'][t]: batch_input[:,t] for t in range(input_seq_len)} feed_dict.update({rnn_model['target_seq'][t]: batch_output[:,t] for t in range(output_seq_len)}) _, loss_t = sess.run([rnn_model['train_op'], rnn_model['loss']], feed_dict) print(loss_t) temp_saver = rnn_model['saver']() save_path = temp_saver.save(sess, os.path.join('/home/weimin/seq2seq', 'multivariate_ts_model0')) print("Checkpoint saved at: ", save_path)

### 5. Inference and visualisation

Once training is done, we can visualise the prediction performance on the test output sequence.

To put them together, we can also visualize both the input and output sequences at the same time (I scaled up the output sequences just to make them more obvious):

## Outliers / Extreme events

### 1. What could be an outlier / extreme event case?

This will be very useful for companies when are dealing with real time traffic, for example. There are many factors that could be the triggers for anomaly events, for example, **the sudden rain of the day** could lead to surge of demand of taxis on the road. **Public holidays** could be another factor which might lead to increase in demands for the whole day.

### 2. Case & data simulation

Let’s assume our traffic patterns are like below, where y-axis is the number of traffic, and x-axis is the days.

Image that during public holidays, we will notice an additional 2 units of traffic will be added on top of what we have for that day, as reflected by those sharp peaks in the graph.

Of course, the real data will have noise for both traffic on normal days as well as on public holidays (which means the true effect of PH – 2 units – is unknown to us, which the model needs to learn through the data). What we do know, however, is whether each day is a public holiday (1) or not (0). Here’s the code for data generation:

input_seq_len = 15 output_seq_len = 20 extreme_factor = 2 # 2 additional units of traffic increase if extreme / outlier x = np.linspace(0, 60, 210) train_data_x = x[:190] np.random.seed(10) num_events_train = 40 # total number of extreme events in training data train_events_bool = np.zeros(190) train_events_x = np.random.choice(range(190), num_events_train) train_events_bool[train_events_x] = 1 num_events_test = 1 # total number of extreme events in test data test_events_bool = np.zeros(20) test_events_x = np.random.choice(range(20), num_events_test) test_events_bool[test_events_bool] = 1 def true_signal(x): y = 2 * np.sin(x) return y def noise_func(x, noise_factor = 1): return np.random.randn(len(x)) * noise_factor def generate_y_values(x, event_bool): return true_signal(x) + noise_func(x) + extreme_factor * event_bool def generate_train_samples(x = train_data_x, batch_size = 10, input_seq_len = input_seq_len, output_seq_len = output_seq_len, train_events_bool = train_events_bool): total_start_points = len(x) - input_seq_len - output_seq_len start_x_idx = np.random.choice(range(total_start_points), batch_size) input_seq_x = [x[i:(i+input_seq_len)] for i in start_x_idx] input_seq_event_bool = [train_events_bool[i:(i+input_seq_len)] for i in start_x_idx] output_seq_x = [x[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx] output_seq_event_bool = [train_events_bool[(i+input_seq_len):(i+input_seq_len+output_seq_len)] for i in start_x_idx] input_seq_y = [generate_y_values(x, event_bool) for x, event_bool in zip(input_seq_x, input_seq_event_bool)] output_seq_y = [generate_y_values(x, event_bool) for x, event_bool in zip(output_seq_x, output_seq_event_bool)] #batch_x = np.array([[true_signal()]]) return np.array(input_seq_y), np.array(output_seq_y), np.array(input_seq_event_bool), np.array(output_seq_event_bool)

As always, we can sample a training data to see what it looks like:

input_seq, output_seq, input_seq_event_bool, output_seq_event_bool = generate_train_samples(batch_size=10) l1, = plt.plot(range(15), input_seq[0], 'go', label = 'input sequence for one sample') l2, = plt.plot(range(15, 35), output_seq[0], 'ro', label = 'output sequence for one sample') plt.legend(handles = [l1, l2], loc = 'lower left') plt.show()

Quite noisy to see the extreme points though. However, you can simply turn off the noise factor (set it to zero) to see clearly the extreme events. But I will skip this verification step here just to save the space.

### 3. If we train the same seq2seq as before …

Now if we use the simulated data to train our seq2seq model previously, we won’t get a very good result. The one outlier in the test data will not be captured by the model. (It’s easy to understand this as the extreme events are totally random, and model has no idea which days they will fall on for both training and test phases.)

So if you test the model, you should see something similar to this:

**The model has poor fit for test data.****The model has no awareness / differentiation on the outlier.**

### 4. Re-design the seq2seq

How do we redesign the model to deal with extreme events?

**Since we already have the knowledge of whether each day is a public holiday, we can pass that ‘knowledge’ to the decoder at each time step i, to let the decoder learn the differentiation between normal days and public holidays. **

In summary, here are the steps:

- Get the bool vector for output sequence (‘1’ for public holiday, ‘0’ for normal days)
- During decoding, at each time step t, concatenate the bool value to the original decoder input as new input
- Feed the new input to decoder to generate output at t
- Continue until the decoding phase is done

We won’t change the rest parts such as ‘guided’ training, and we also won’t touch the encoding phase. With these steps implemented, we should be able to see our improved results on test data:

Overall it fits well, and most importantly, the outlier point was perfectly captured!

Please refer to my github page for complete codes and notebook.

## Case study on Beijing PM2.5 pollution forecasting

### 1. The case introduction.

Finally, we will apply what we learnt on a real-world data set – the hourly data set that contains the PM2.5 data of US Embassy in Beijing.

If can download the data from here. The complete list of features included in the raw data set is:

**No**: row number**year**: year of data in this row**month**: month of data in this row**day**: day of data in this row**hour**: hour of data in this row**pm2.5**: PM2.5 concentration**DEWP**: Dew Point**TEMP**: Temperature**PRES**: Pressure**cbwd**: Combined wind direction**Iws**: Cumulated wind speed**Is**: Cumulated hours of snow**Ir**: Cumulated hours of rain

Our task here, however, is only to predict the pollution factor – **pm2.5** – for the next few hours, based on historical trends of other features. In short, it’s going to be a multi-variate case with **n-input-1-ouput**.

### 2. Visualisation and pre-processing of data

A quick df.head() function plus simple plots give us an intuitive understanding of the data set.

All features plotted above are numeric / continuous. However, there’s one more called **cbwd **which is categorical, because it is referring to the wind directions. We will do **one-hot encoding** for that. I will not be using data such as year, month, day or hour as my features. However, you are encouraged to add in those as features as well, to see if those might help your performance.

Further more, there are no NAs values for all features **except for** **pm2.5 – **there are **2067** out of **43824** are **NAs** in pm2.5. And we will simply fill NAs with 0.

Regarding **train and test splits** – I will use **last one month data for test**, and all data before that as my training sample. Feel free to split according to your preferences.

Finally, I will do a simply processing step to normalize all data using **z-score**.

The overall scripts are below:

import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns df = pd.read_csv('./PRSA_data_2010.1.1-2014.12.31.csv') print(df.head()) cols_to_plot = ["pm2.5", "DEWP", "TEMP", "PRES", "Iws", "Is", "Ir"] i = 1 # plot each column plt.figure(figsize = (10,12)) for col in cols_to_plot: plt.subplot(len(cols_to_plot), 1, i) plt.plot(df[col]) plt.title(col, y=0.5, loc='left') i += 1 plt.show() ## Fill NA with 0 #print(df.isnull().sum()) df.fillna(0, inplace = True) ## One-hot encode 'cbwd' temp = pd.get_dummies(df['cbwd'], prefix='cbwd') df = pd.concat([df, temp], axis = 1) del df['cbwd'], temp ## Split into train and test - I used the last 1 month data as test, but it's up to you to decide the ratio df_train = df.iloc[:(-31*24), :].copy() df_test = df.iloc[-31*24:, :].copy() ## take out the useful columns for modeling - you may also keep 'hour', 'day' or 'month' and to see if that will improve your accuracy X_train = df_train.loc[:, ['pm2.5', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Is', 'Ir', 'cbwd_NE', 'cbwd_NW', 'cbwd_SE', 'cbwd_cv']].values.copy() X_test = df_test.loc[:, ['pm2.5', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Is', 'Ir', 'cbwd_NE', 'cbwd_NW', 'cbwd_SE', 'cbwd_cv']].values.copy() y_train = df_train['pm2.5'].values.copy().reshape(-1, 1) y_test = df_test['pm2.5'].values.copy().reshape(-1, 1) ## z-score transform x - not including those one-how columns! for i in range(X_train.shape[1]-4): temp_mean = X_train[:, i].mean() temp_std = X_train[:, i].std() X_train[:, i] = (X_train[:, i] - temp_mean) / temp_std X_test[:, i] = (X_test[:, i] - temp_mean) / temp_std ## z-score transform y y_mean = y_train.mean() y_std = y_train.std() y_train = (y_train - y_mean) / y_std y_test = (y_test - y_mean) / y_std

### 3. Transforming training and test data into 3-D formats.

Now we have X_train, y_train, X_test, y_test as our train and test data sets, we will then need to transform them into 3-D formats for time series –** (batch_size, time_steps, feature_dim)**

We will prepare two util functions to do so:

input_seq_len = 30 output_seq_len = 5 def generate_train_samples(x = X_train, y = y_train, batch_size = 10, input_seq_len = input_seq_len, output_seq_len = output_seq_len): total_start_points = len(x) - input_seq_len - output_seq_len start_x_idx = np.random.choice(range(total_start_points), batch_size, replace = False) input_batch_idxs = [list(range(i, i+input_seq_len)) for i in start_x_idx] input_seq = np.take(x, input_batch_idxs, axis = 0) output_batch_idxs = [list(range(i+input_seq_len, i+input_seq_len+output_seq_len)) for i in start_x_idx] output_seq = np.take(y, output_batch_idxs, axis = 0) return input_seq, output_seq # in shape: (batch_size, time_steps, feature_dim) def generate_test_samples(x = X_test, y = y_test, input_seq_len = input_seq_len, output_seq_len = output_seq_len): total_samples = x.shape[0] input_batch_idxs = [list(range(i, i+input_seq_len)) for i in range((total_samples-input_seq_len-output_seq_len))] input_seq = np.take(x, input_batch_idxs, axis = 0) output_batch_idxs = [list(range(i+input_seq_len, i+input_seq_len+output_seq_len)) for i in range((total_samples-input_seq_len-output_seq_len))] output_seq = np.take(y, output_batch_idxs, axis = 0) return input_seq, output_seq

So generate_train_samples will randomly sample batches from training, and generate_test_samples will simply convert all test data into 3-D formats in sequence. We can print out the shapes of the processed data:

x, y = generate_train_samples() print(x.shape, y.shape) test_x, test_y = generate_test_samples() print(test_x.shape, test_y.shape)

(10, 30, 11) (10, 5, 1)

(709, 30, 11) (709, 5, 1)

### 4. Train the model and visualize the results on test.

The model to use will be no different than the multi-variate one we used before. So I will not duplicate that part here.

Training part is also similar, we can set **feed_previous** to **False** for guided training.

After training is done, we will visualize the results over the last month to evaluate our prediction performance, as below:

Well, slightly off but not too bad :) The orange is predicted and blue is actual value for each hour. I believe with some tweaks on the hyperparameters one might get a better results!

## Any thoughts?

In this tutorial, we have demonstrated how to use sequence to sequence architecture to learn and predict Time Series signals. We learnt that seq2seq model really has the power to *sift through the noisy information*, *uncover the true patterns*, and eventually comes down to *discovering the hidden signals* that are actually governing the data generation mechanism.

We have seen that the model actually performs equally well on both univariate as well as multivariate cases, with the additional capability of encoding and decoding variable lengths of signals. We have also witnessed the flexibility of re-designing the seq2seq model to take care of the extreme events scenario, which is powerful predict the ‘unpredictability’ of the real world.

What other problems may be more challenging and useful to tackle in your business settings? Please kindly share your ideas or thoughts, and we will see how to design the seq2seq to fit your needs.

### References

**TensorFlow official repo –** https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py

**Another excellent blog about seq2seq for Time Series – **https://github.com/guillaume-chevalier/seq2seq-signal-prediction

**If you would like to try Keras – ** https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

Great post. I am a beginner in Data Science/Machine Learning but I can’t wait to deepen my knowledge in multi variate time-series forecasts. It appears to be a very promising field. I’m going to print your post and read it carefully multiple times until I’ll be ready to deploy my own neural network with tensorflow.

LikeLike

Glad to hear that! I am happy that this post might help you in some way and wish you great progress in ML!

LikeLike

Hi Weimin,

This is an excellent post, thank you!

Do you think it would be possible to feed in the next day’s weather forecast to the decoder similar to how you’ve fed in the public holiday Boolean value?

This would make the system very flexible – for example an expert could create a ‘best-guess’ time series for the next day and feed that into the decoder to help it out. There would be many other possible inputs to provide here.

Regards,

Michael J

LikeLike

Hi Weimin,

This is an excellent post, thank you!

Do you think it would be possible to feed in the next day’s weather forecast to the decoder similar to how you’ve fed in the public holiday Boolean value?

This would make the system very flexible – for example an expert could create a ‘best-guess’ time series for the next day and feed that into the decoder to help it out. There would be many other possible inputs to provide here.

Regards,

Michael J

LikeLike

Hi Michael,

Thanks for your comment. I would strongly encourage you to try this and see whether this can improve the forecast accuracy compared to model with no extra input, and share your results:). Yes indeed there are other possible inputs to provide in addition to boolean values like you said. And it will also depend on the data set.

Thanks,

Weimin

LikeLike

I think I’ll give it a shot!

I found a paper that uses the same technique, passing in the day, week and hour to the decoder:

https://arxiv.org/abs/1610.09460

It looks like they achieved good results.

LikeLike

Thanks for the sharing. I checked the paper and it was quite an interesting case.

I think I may also try this dataset to forecast weather polution – https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data

If I can get some interesting results, I will add it in to the post too. Stay tuned!

LikeLike