Using TensorFlow to build image-to-text application


The latest version of my code in github has implemented beam search for inference. However, I also leave the greedy sampling approach there as well, in case anyone want to compare. They are for beam search, and for greedy sampling.


Image captioning, or image to text, is one of the most interesting areas in Artificial Intelligence, which is combination of image recognition and natural language processing. Image captioning is a deep learning system to automatically produce captions that accurately describe images. With examples below, we will understand more easily what it does (the titles are predicted captions from the model):

In this article, we will be discussing in details about using TensorFlow to write an image captioning model that will be able to generate interesting text to any given input image.

The data we will be using can be found from Microsoft COCO competition –

The state of the art model for image captioning was developed by Google and published in 2016, and they have also open-sourced their entire solution in TensorFlow.

I was amazed at their approach but was unable to find a comprehensive tutorial online. So I created this post just to show what I understood about their model, and to explain my implementation using TensorFlow of a simplified version of ShowAndTell. I hope my tutorial would give you some idea of building an image captioning model from scratch.

Please see the complete scripts in my github repo.

The Model

Our model architecture here is similar to Google’s Show and Tell Model 2016, but I have simplified a lot of details so that the implementation as well as training will be easier. The overall model architecture is shown below:


From Google’s paper –

The details of the model as well as its training approach will be discussed further in subsequent sessions. The image model we are using here is Inception V3, and the layer we use to extract image features is ‘pool_3:0’ layer of Inception V3 – the next-to-last layer containing 2,048 float description of the image. The difference, however, in model architecture compared to Google’s is that we are not fine tuning the image model during our training, and instead we simply use it as an image encoder to extract features (which means we are freezing Inception V3 during LSTM training).

We will first extract features and save them offline as numpy arrays into our local. As far as I can see, this has substantially increased our computational efficiency, because we ensured that we do feature extraction only once. This also makes out training convenient and easy to manage. (Two phases can be done independently at different time)

From beginning to end – TensorFlow part

Some basic understanding of TensorFlow will be useful. (e.g. building a graph, executing a graph, concepts of variables, placeholder, feed_dict, etc). If not, there are tons of good places you can go to both on youtube and internet. The recently released Stanford course – CS 20SI: Tensorflow for Deep Learning Research is one of them.

1) Data pre-processing

We need first to download all images from MSCOCO challenge (20+ G in total of 120K images for both training and validation) as well as the caption files. Then, we will need to generate the followings:

  1. The vocabulary list of your choice (e.g. you can choose top 5000 words based on appearing frequency in the captions data set, plus 4 additional tokens of for unknown words, for start of a sentence, end of a sentence and for padding a sentence. So there are total 5000 + 4 words if you choose 5000. )
  2. Two dictionaries of word_to_index and index_to_word, based on the vocab above.
  3. Index all captions using this word_to_index dictionary. And pad each caption to a fixed length vector (e.g. you can choose length of 25) with and at the start and beginning, for all words not in the vocab, as well as tokens for all remaining empty spaces.
  4. Extracted image features using Inception V3 pretrained model. Extract features from all your training and validation images using inception v3 model, and save them into numpy arrays to your local.
  5. Finally, create train_image_index and val_image_index lists, which match each caption to the correct row index of the feature numpy arrays created above. (Basically, match each caption to each image)

Step 1:

Run to generate coco2014_captions.h5, which will contain all data we will need for our training later on.

python --file_dir /home/ubuntu/COCO/dataset/COCO_captioning/ --total_vocab 2000 --padding_len 25

Step 2:

Extract pre-trained features. Run to extract inception v3 features and save as train2014_v3_pool_3.npy and val2014_v3_pool_3.npy to your local directory.

python --model_dir /tmp/imagenet --image_dir /home/ubuntu/COCO/dataset/train2014 --save_dir /home/ubuntu/COCO/dataset/COCO_captioning/train2014_v3_pool_3 
python --model_dir /tmp/imagenet --image_dir /home/ubuntu/COCO/dataset/val2014 --save_dir /home/ubuntu/COCO/dataset/COCO_captioning/val2014_v3_pool_3 

Before you run, make sure you have downloaded captions_train2014.json, captions_val2014.json from MSCOCO website, and save them in the folder e.g. /home/ubuntu/COCO/dataset/COCO_captioning/. Then download all the train and validation images, and put them in the folders e.g. /home/ubuntu/COCO/dataset/train2014 and /home/ubuntu/COCO/dataset/val2014 separately.

Now we have got all the data we will need in ../COCO_captioning. 

Please refer to details on and

2) Build the graph

Next we will embed the extracted image features previously as well as the training captions, and will feed them into our model which is just a chain of LSTM cell unrolling.

2.1 Embed sequences and image features

We start by embedding our words and image features into a fixed length vector. The function in TensorFlow to create the embedding map for words is:

embedding_map = tf.get_variable(
shape=[config.vocab_size, config.embedding_size],

and we will use tf.nn.embedding_lookup to map each captions into the embedding matrix.

seq_embedding = tf.nn.embedding_lookup(embedding_map, input_seqs)

For example, if our input_seqs is a batch of captions, the output seq_embedding will be of shape like [batch_size, sequence_length, config.embedding_size].

Similarly, we embed image features in way below:

image_embeddings = tf.contrib.layers.fully_connected(

2.2 LSTM model

And our model is based on LSTM cell. First we define our LSTM, and we can also add dropout to it if we want:

lstm_cell = tf.nn.rnn_cell.LSTMCell(
num_units=config.num_lstm_units, state_is_tuple=True)
lstm_cell = tf.nn.rnn_cell.DropoutWrapper(

The keep_prob should be a placeholder so that we can make it different for training and inference steps (e.g. it should be 1.0 for inference)

Then the initial state is created by using the image_embeddings:

_, initial_state = lstm_cell(image_embeddings, zero_state)

Where the zero_state can be created using the cell’s zero_state function:

zero_state = lstm_cell.zero_state(batch_size=config.batch_size, dtype=tf.float32)

We then feed to the model using our sequence embedding. During training, the LSTM will predict each word of caption given the current state of the model and the previous word in the caption. However, during inference, the previous word is obviously unknown so it will use the word generated by the model itself at the previous time step.

The outputs from the model will then be mapped using a weight matrix W into a tensor of shape [-1, vocab_size]. This is our ‘logits’ tensor that we can calculate logloss or sample out the output words.

We will also create a variable called global_step, which is used to monitor the total steps the model has run for so far. This will be used for learning rate decay function which we will talk about in the next part of training.

Remember we have padded our captions in the data preparation stage, so that each of them will have become a fixed length vector now (e.g. 25 in length). This is useful when we batch’em up during training stage. However, when we are evaluating the logloss, we need to multiply all the logloss by a mask which has 1s for all normal words, and 0s for all words that are padded, so that we will not calculate the losses for those words corresponding to positions from the original training caption. And then we average the summed logloss by dividing by the sum of the mask, so that it will be our final total loss for the model.

The core function we are using here is tf.nn.dynamic_rnn. This is a very useful function, which will perform unrolling of input sequence during training (so that we can avoid using a ‘for’ loop). However, it can be quite confusing in the beginning to understand what are the inputs and outputs of the function. It is recommended to look at a few example references to get a better understanding of this function.

lstm_outputs, final_state = tf.nn.dynamic_rnn(cell=lstm_cell,

Just to explain – we feed as input the lstm cell we previously defined, the input caption embedding, actual length of each caption, and the initial state of the LSTM. It will return list of outputs for each state which we will map to vocabularies later, and the final state of LSTM cell.

2.3 The returned model

Eventually, our returned ‘model’ is just a dict containing all the variables and placeholders.

return dict(
 total_loss = total_loss,
 global_step = global_step,
 image_feature = image_feature,
 input_mask = input_mask,
 target_seqs = target_seqs,
 input_seqs = input_seqs,
 final_state = final_state,
 initial_state = initial_state,
 preds = preds,
 keep_prob = keep_prob,
 saver = tf.train.Saver()

Please refer to the code here for details of building the model.

3) Train the model

Step 1: Load the data. The function load_coco_data is provided in the file. We run it by passing in the /COCO_captioning/ directory. Details of loading data can be found in file.

Step 2: Build the model we have described in part 2). Please also note that we are able to fit the build_model function with a pretrained word vectors in numpy array formats. This array will be used to initialize the embedding_map in the graph. However, you are free not to do so and the embedding_map will be initialized using random uniform. Additionally, we will need to pass in the model_config parameter, which we will describe in the next part, and the mode parameter is simply ‘train’.

Step 3: Set up the learning rate decay. Here we are using the exact same exponential decay function used in Google’s show and tell model. The function is created using tf.train.exponential_decay, with its parameters defined in the configuration file which we will talk about soon. In addition, we will pass in the global_step variable we have defined previously.

Step 4: Prepare the training op. We set up the training op by minimizing the logloss we have previously defined in our graph. In addition, we will pass in the learning rate decay function we just created, as well as other parameters like initial learning rate, optimizer and clip_gradients, all of which are defined in the configuration file.

Step 5: Train the model! At this stage, we have defined our entire model completely, and we are ready to train it.

We first create a session in which we will do the training, and initialize all the variables. Then the total number of training iterations will be first calculated based on number of epochs we will define in the configuration file.

Then for each iteration, we will call a function called _step(). What it does is basically fetch a batch of training data, prepare input, target and mask sequences, and use feed_dict to feed all the numpy arrays into the graph. Then it will run which will trigger the batch update for all trainable variables defined in the graph. It gives the total loss of the model after it finishes this iteration of training.

Saving the session – Sessions can be saved as checkpoints periodically during training by setting the ‘–saveModel_every’ parameter as the number of interactions to run before each saving.  Eventually when the training is done, it will be saved in the specified folder. However, please note that the things that are actually saved are the variables, not the graph itself. Because later on when we want to bring back the model from those checkpoints, we will need to rebuild the graph, and then load those saved variables back to the graph.

Step 6: monitor the progress. We will print out the loss from time to time. However, this is training loss. And it will be too time consuming to run on all 40,000 validation images to get validation loss. However, you are free to explore implementing validation loss or you can select a number of validation images to evaluate. Alternatively, you can explore other evaluation metrics that are being used in COCO competition –

What I did instead was to create another function called _step_val(), which is to sample out 32 images from validation set every time after a certain number of iterations (say 5,000), and I will do an inference on them by predicting the captions. This way, I will be able to see some intermediate captioning results from time to time during the training process.

Here’s the detailed code for training.

 4) The configuration file

Most of the hyper-parameters declared in the configuration file are quite self-explainable with the comments. There are basically two types of hyper-params – ModelConfig and TrainingConfig. ModelConfig contains the hyper-parameters that will be needed when building the graph, such as vocabulary size, embedding size, image feature size, LSTM cell size and so on. TrainingConfig contains parameters that will be used during training, such as optimizer, learning rate and decay, epochs and gradients clipping.

Here’s the configuration file.

5) Inference

There are many approaches we can use to generate a sentence given an image, and the method we used here is beam search: iteratively consider the set of the k best sentences up to time t as candidates to generate sentences of size t + 1, and keep only the resulting best k of them. (However, a much simpler method is to use greedy sampling: sample each output word at time t by choosing the word with highest probability, and use it as input to time t+1, and so on.)

Concretely, our inference steps are as follows:

a) Extract features from test images. Feature extraction is very similar to what we have done previously in py. We will load the inception V3 model, and feed in images from test folder. Finally, what we will get is a numpy array of shape [num_of_images, 2048].

b) Feed extracted features to LSTM model. After we have got the V3 features, we will feed those them into LSTM to get the initial state. This is done by codes below:

feed_dict = {model['image_feature']: features,
             model['keep_prob']: keep_prob}
state =['initial_state'], feed_dict=feed_dict)

c) Predict the whole captions using beam search.

1. Start with one caption object, having a initial state from previously, and one START token only for its sentence, and a score of zero.

2. Interactively generate k (beam size, can be 3) new caption objects for each existing caption objects. The new caption object will have one more word for its sentence than existing caption, and will have the new state generated using the previous word. We will also update each caption’s score by adding the log probability of the new word. We will then delete all old captions, and push in all the newly generated caption objects into the pool after this iteration.

3.  Push caption object into a TopN heap class once the caption sentence reached END. The TopN heap will only keep the top k completed captions based on score.

4. Return the sentence of the caption who has the highest score.

The code for caption generator can be found here.

d) Decode the captions and write them on the images. We will use function decode_captions to translate the predicted indices into words until the token, ignoring all tokens. We will then write the captions on each image as title and save them in the specified local folder so that we can view them later. Sample saved image is like below:


The inference code can be found here.

Conclusion and future improvements

We have implemented the entire framework to build and train an image captioning model from scratch, and use the trained model to do inference on new test folder of images.

During training, with around 3-4 epochs, I was able to see some meaningful results generated for the validation images, and the training loss has been reduced to a constant level. The whole process (including data preprocessing and training) took me only few hours on amazon AWS p.2 instance with a single GPU.

However, the results are far from perfect, and there are a lot of places on which the model can be improved. In addition, I have also tried using GloVe word vectors to initialize word embedding for the vocabulary, but it did not help much in the performance.

Below listed some possible areas:

  1. Adding some fine tuning of the image model (inception V3) while training the LSTM. According to Google, however, this can only be done after training LSTM for a certain number of steps to stabilize the gradients. But this will significantly increase the training time as fine tuning image model is expensive.
  2. Consider a partially guided training approach by allowing the model to use the words generated by itself in the previous steps, providing it with the previous words in the caption.
  3. Ensembling different models.
  4. Using a deeper LSTM


  1. Convolutional Neural Networks for Visual Recognition
  2. Google Show And Tell Research Blog Post
  3. Microsoft COCO Competition
  4. Recurrent Neural Networks in TensorFlow II

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close