Introductory Tutorial to TensorFlow Serving

It’s easier to build a TensorFlow model and train it – or at least you can find many great starting scripts to help you begin. However, to serve your trained model is not as easy. You will use a different library called TensorFlow Serving, and there aren’t many examples out there (most of which are focused on image models). This post serves as a basic tutorial to get things started.

This post is divided into below parts:

  1. Train and save TF model in protobuf. 
  2. Download the TensorFlow Serving docker of all relevant dependencies, and set up the environment for serving. 
  3. Serve the model in local!
  4. What’s next?
  5. References

The focus here is to demo serving a simple model (linear regression), which can then be used as starting point for your own projects of servings. If you are interested in serving an image model, where your input is a .jpeg image or something, there are many great tutorials out there.

You can see the complete code for export_model as well as client script in my github repo

1. Train and export a linear regression model.

Training part is very simple, basically, you have an x which is input, w and b which are the weight and bias variables you want to learn, and y and y_which are your output and regression target, respectively. And we are letting the model to learn some random equation like below:

y = x1 * 1 + x2 * 2 + x3 * 3

So here’s the code:

  • Input arguments declarations (which are self-explanatory):'training_iteration', 300,
'number of training iterations.')'model_version', 1, 'version number of the model.')'work_dir', '', 'Working directory.')
  • Training:
sess = tf.InteractiveSession()

x = tf.placeholder('float', shape=[None, 3])
y_ = tf.placeholder('float', shape=[None, 1])
w = tf.get_variable('w', shape = [3,1], initializer = tf.truncated_normal_initializer)
b = tf.get_variable('b', shape = [1], initializer = tf.zeros_initializer)

y = tf.matmul(x, w) + b

ms_loss = tf.reduce_mean((y - y_)**2)

train_step = tf.train.GradientDescentOptimizer(0.005).minimize(ms_loss)

train_x = np.random.randn(1000, 3)
# let the model learn the equation of y = x1 * 1 + x2 * 2 + x3 * 3
train_y = np.sum(train_x * np.array([1,2,3]) + np.random.randn(1000, 3) / 100, axis = 1).reshape(-1, 1)

train_loss = []

for _ in range(FLAGS.training_iteration):
    loss, _ =[ms_loss, train_step], feed_dict={x: train_x, y_: train_y})
print('Training error %g' % loss)

print('Done training!')

After training, you want to export out the whole graph as well as trained variables into disc.

You will need to create a tf.saved_model.builder.SavedModelBuilder object, pass it the directory where you model will be saved. You will use it to save your model later.

Before that, you will need to build the ‘prediction_signature’, which basically has ‘input’ and ‘output’ of your graph.

The code for doing so is below:

export_path_base = FLAGS.work_dir
export_path = os.path.join(
print('Exporting trained model to', export_path)
builder = tf.saved_model.builder.SavedModelBuilder(export_path)

tensor_info_x = tf.saved_model.utils.build_tensor_info(x)
tensor_info_y = tf.saved_model.utils.build_tensor_info(y)

prediction_signature = (
  inputs={'input': tensor_info_x},
  outputs={'output': tensor_info_y},

legacy_init_op =, name='legacy_init_op')

  sess, [tf.saved_model.tag_constants.SERVING],

print('Done exporting!')

Once done, you should be able to see a folder named ‘1’ inside your working directory. When cd to the ‘1’ folder, you should see two file/folder below:

saved_model.pb variables

where the first one is your serialized model in protobuf, which includes the graph definition of the model, as well as metadata of the model such as signatures. And second one contains serialized variables of the model.

Until now you have trained and saved your model!

2. Set up the docker environment.

Why docker – You can choose to install everything including dependencies and build them on your machine, and serve the model in your local environment. However, I would rather go the ‘clean’ way – download the docker image with all dependencies installed, and then serve my model inside the docker.

This will make sure all it needs to serve the model is inside the minimal ‘VM’ which is created solely for the purpose for serving. Also, google has well prepared a docker image file with every set up step so all you need to do is ‘pull’ the docker image, and all will be set up and configured for you!

To install docker software in your machine, follow this link:


Below is a sequential list of steps you can just follow along to set up docker after you have installed it.

1) Clone the serving repo to your local machine.

git clone --recursive 
cd serving

2) Create the docker container with all dependencies.

docker build --pull -t $USER/tensorflow-serving-devel -f tensorflow_serving/tools/docker/Dockerfile.devel .

3) Run and get inside the docker container

docker run --name=tensorflow_container -it $USER/tensorflow-serving-devel

4) Clone again the serving repo in docker. (run this inside your docker terminal!).

When asked, simply go with default values for the first few questions, and choose all ‘No’s for a faster build.

git clone --recursive
cd serving/tensorflow

5) Build the tensorflow serving inside docker. (might take around 20-30 mins)

cd .. 
bazel build -c opt tensorflow_serving/...

6) Once done building, you can test your model serving by running:


and you should see output something like below:


3. Serve the model

The steps are launch your gRPC server, write up your client script, and make inference.

But first, you will need to put your previously trained model – files inside the ‘1’ folder – into your docker container. To do so:

docker cp YOUR_DIR_OF_TRAINED_MODEL/1 YOUR_CONTAINER_ID:/serving/my_model/1

and make sure the paths are correct.

a) Launch the server

Inside your /serving directory, run below command:

bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=example_model --model_base_path=/serving/my_model &> my_log &

Take note:

  1. You can use any name for ‘–model_name’, but it must be consistent later on in writing up your file.
  2. ‘–model_base_path’ must be absolute path to your model’s directory.

If successful, you can ‘cat’ your my_log file, and see the last line of something like:

Running ModelServer at ...

Then your server is up and running!

b) Prepare file

The next step is to write up the client file.

Since our model is a simple linear regression, I will prepare some random test data, and simply let the model predict the outputs. All ingredients are wrapped inside one function called do_inference, as below:

from grpc.beta import implementations
import numpy
import tensorflow as tf
from datetime import datetime
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2'server', 'localhost:9000', 'PredictionService host:port')

def do_inference(hostport):
  """Tests PredictionService with concurrent requests.
  hostport: Host:port address of the Prediction Service.
  pred values, ground truth label
  # create connection
  host, port = hostport.split(':')
  channel = implementations.insecure_channel(host, int(port))
  stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)

  # initialize a request
  request = predict_pb2.PredictRequest() = 'example_model'
  request.model_spec.signature_name = 'prediction'

  # Randomly generate some test data
  temp_data = numpy.random.randn(10, 3).astype(numpy.float32)
  data, label = temp_data, numpy.sum(temp_data * numpy.array([1,2,3]).astype(numpy.float32), 1)
  tf.contrib.util.make_tensor_proto(data, shape=data.shape))

  # predict
  result = stub.Predict(request, 5.0) # 5 seconds
  return result, label

def main(_):
  if not FLAGS.server:
    print('please specify server host:port')

  result, label = do_inference(FLAGS.server)
  print('Result is: ', result)
  print('Actual label is: ', label)

if __name__ == '__main__':

Basically, it will create connection to your server, prepare the request object, generate some test data for the request, and then send the request to server for prediction.

Please note that below two lines represent the model name which we just launched (must be the same name), as well as the ‘prediction’ signature we have built into our model earlier. = 'example_model'
request.model_spec.signature_name = 'prediction'

c) Run for inference!

Finally, we come to the stage of making inference on the server. For this step, I choose to run the using ‘python’ in command line. Then, you will need to pip install both tensorflow-serving-api as well as tensorflow for your python, as they are the dependencies in the file.

pip install --upgrade pip
pip install tensorflow
pip install tensorflow-serving-api

However, if you want to run it with bazel without using python command, can explore here. Basically, you will need to edit and put the ‘BUILD’ file together with your, and use bazel to build it. After built, you can run it without worrying about those dependencies. Both approaches work, but I will choose to ‘python’ way in the tutorial which is easier. 

Run your for prediction!

python --server localhost:9000

You will see outputs similar to below:

root@dec55442cba1:/serving# python --server localhost:9000
Result is: outputs {
 key: "output"
 value {
 dtype: DT_FLOAT
 tensor_shape {
 dim {
 size: 10
 dim {
 size: 1
 float_val: -3.33661770821
 float_val: -2.63279175758
 float_val: -1.54009318352
 float_val: -7.66162586212
 float_val: 1.74936723709
 float_val: 6.54203081131
 float_val: 2.55403280258
 float_val: 4.98167848587
 float_val: 2.72653889656
 float_val: 2.32017993927

Actual label is: [-3.46463823 -2.7302022 -1.66026235 -7.96920013 1.81441534 6.79461861
 2.65168643 5.21349049 2.85487795 2.43829441]

Our outputs and labels are close, proving our model is correct!

d) Can also use a Flask server!

You can also wrap the client inside a Flask server in python, so that you are flexible to send HTTP requests to it with prediction data set, instead of one-time service like above. This additional layer of Flask only forwards data to tensorflow server and to output results received from server.

Based on our load test, the Flask + TF Serving architecture has a relatively low latency

The Flask script can be found in my repo here.

The basics will be the same. So once run, you will see the output like below:

root@8445d7955418:/serving/weimin_model# Initialization done.
 * Running on (Press CTRL+C to quit)

And you could send POST requests to it such as:

curl -X POST \ \
 -H 'cache-control: no-cache' \
 -H 'content-type: application/json' \
 -H 'postman-token: 1b4663d0-fc47-007a-673d-721ebad9985e' \
 -d '[1,2,3]'

4. What’s next?

In this post, we have demonstrated how to use TensorFlow to train and export a simple linear regression model into disc, set up the Model Serving environment on Docker, as well as serve the model locally.

Naturally, the next step would be to deploy the model in Productionand eventually to automate the process of training, deploying and management, and to be able to scale up services as number of requests increase.

I will save it possibly to the next session, and would share more later on about how to deploy model using things like Google’s Kubernete which is a highly scalable container engine.

5. References


Possible Errors:

ImportError: No module named autograd

The docker lacks the dependency of ‘autograd’. So simply do a ‘pip install autograd‘.


9 thoughts on “Introductory Tutorial to TensorFlow Serving

  1. Hi there, I followed your steps, but I ‘m getting `Connect Failed` when sending POST request to flask client. What I did is simply run `python` and send the POST request in this article. Do you know what’s the problem?


    1. Hmmm that’s strange… Did you successfully launch the TF sever before hand? Did you use the same port in TF server as in your Flask client?

      Which lines of code gives that error? Was that from flask itself or from connection to TF Server? Alternatively you may want to Google that error, cause I didn’t see that error before …


      1. Nevermind, the problem is that I haven’t map and export the port from docker to external environment. But thank you for your reply.


  2. I actually made this architecture works on kubernetes with servers and clients served in different dockers.


    1. Nice. It would be good if you can share and we all can learn from it.


      1. Yes, I am going to write this down when I have time, I would like you to view it.


  3. Hi May i i worked with flask file. It says its running on , and when i accessed this port on web browser it says site cant be reached. How i do see the output on the web screen?


    1. my container is running on and my VM is running on


  4. Thanks for your tutorial. Could you please let me know how to implement SSL/TLS in Tensorflow serving.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close