Parasite ID | Serving a Keras Model with Flask

Though AWS Lambda was an elegant solution, it has strict constraints on the size of the code and memory used. Because the Keras and TensorFlow libraries require so much space on their own, there is not a lot of space left for a model. As we continued to train new and better models, they grew in size - and we were forced to outgrow AWS Lambda as a comprehensive solution.

Searching for a New Home

We considered several options for serving our model to the developing world, as we had originally hoped.

I naively thought we could have our Lambda function simply perform actions on our compute server via SSH. It turns out this is more complicated than it seems and does not make for robust serving architecture.
We explored keras.js and TensorFlow.js, both interesting solutions. However, our model was too large for these to be feasible either. We will be looking into it eventually though - we would love for the app to be used on a phone without internet/cell connection! This would require drastic changes to the model. This was one my favorite demos I encountered.
I also considered freezing the Keras model as a TensorFlow graph (here is an excellent tutorial) in order to use TensorFlow Serving. This is still a strong candidate for the future, as it is already production-ready. You can find more information here.
Ultimately, I decided to go with Flask, a package that turns Python code into a REST API. It makes it very easy to develop and test new code, so it suits our purposes well for now.

Building a Flask API

A basic Flask app looks like this:

from flask import Flask
app = Flask(__name__)
 
@app.route("/")
def hello():
    return "Hello World!"
 
if __name__ == "__main__":
    app.run()

(taken from here) Try it! Running from the command line will set it up to listen on port 5000 by default:

$ python hello.py
 * Running on http://localhost:5000/

After installing uwsgi using these instructions, we can host a production-worthy server with the following command:

uwsgi --master --http 0.0.0.0:5000 --wsgi-file flaskify.py --callable app --processes 2 --threads 2 --virtualenv flaskenv --logto serverlog.log &

The above command specifies the process manage (--master), the app source code (--wsgi-file ./flaskify.py --callable app), the computing requirements (--processes 2 --threads 2), the virtual environment directory (--virtualenv ./flaskenv), and the location of the log output so it can run in the background (&).

Our model could then be called from the following simple python code:

import requests
url = 'http://{}:{}/{}'.format(server_url, port, variables)
response = requests.get(url)

Keeping the Model in Memory

The last hurdle was to enable the app to hold our model in memory. This way it could return a prediction very quickly, instead of loading the model from disk each time the API was called. It is trickier than it sounds, due to TensorFlow sessions (which are outside the scope of this post). It can be simpler to use a frozen graph (see this excellent post).

We finally got the model to remain in memory by saving the TensorFlow graph framework within the app. It is implemented in the following code:

from flask import Flask
from keras.models import load_model
import tensorflow as tf

app = Flask(__name__)

# Perform these actions when the first
# request is made
@app.before_first_request
def load_model_to_app():
    # Load the model
    app.model = load_model('models/model.h5')
    
    # Save the graph to the app framework.
    app.graph = tf.get_default_graph()

app.route(‘/<image_path>’)
def classify(image_path):
    model = app.model
    graph = app.graph
    with graph.as_default():
        return(our_function(image_path, model))

This is how we deliver a prediction so quickly.

Cameron Bell
Ultrarunner, backcountry skier, dad, and data scientist.