After our localization algorithm was running, we had a few choices to make. We could abandon the original VGG model (which requires cropped images) and just use the RetinaNet model (which automatically detects objects). Or, we could just evaluate RetinaNet’s proposed regions with the VGG model, classifying them as one parasite egg or another. Though the VGG model is much faster and more accurate (97% vs 95%), we knew that object detection would dramatically improve the usefulness of our tool. So we asked, how can we use RetinaNet’s object detection without diluting the high accuracy we had achieved with VGG?

      We use both models to check each other’s work. Here’s how we do it:

  • First, the entire, uncropped image is resized to a square of size 224x224 pixels and fed into the VGG model. If it classifies it as a parasite with high confidence (currently set at 95%), then it returns that classification. This is just as fast and accurate as before, because it is the VGG model on its own.

  • Image detected by VGG

  • However if the VGG model fails to confidently identify a parasite, it passes the work on to RetinaNet. Retinanet scans the image and returns all proposed regions, even those with very low confidence. We didn’t want to miss anything during this step. If a window cannot be expanded to be perfectly square and still remain within the image, we discard it (this misses some objects on the edge, but ensures that we can use VGG later).

  • Region proposals from RetinaNet

  • Overlapping windows are resolved. We use the metric "intersection over minimum area", which calculates what percentage of the smaller window intersects with the larger window. The red portion in the below images depicts intersection - in this case IoM = red/(red + yellow). If IoM is over a certain amount (currently 70%), we keep the window with a higher confidence score. Doing this for every pair in the set of overlapping windows leaves us with the one that RetinaNet most confidently predicts.

  • Intersection over minimum area

  • After the overlapping windows are resolved, we are usually left with only a few windows. These are fed into VGG to obtain a prediction. If VGG returns a high confidence score, we return it to the user.

  • If VGG disagrees with RetinaNet’s prediction or is under a confidence threshold, we tell the user that we cannot make a confident prediction for that object. This is a form of an ensemble method, where multiple models are used to evaluate a single image.

  • Finding and classifying images

      Using the two models to check against each other hedges against mistakes with high consequences. We would much rather tell the user we are unsure even though we may be right because in all likelihood, the next picture they upload will be clear enough to recognize.

Cameron Bell
Ultrarunner, backcountry skier, dad, and data scientist.