Computer vision has three main tasks:
Object detection assesses whether or not there is an object in an image.
Object recognition classifies that object as one of multiple classes.
Object localization tells us where the objects are in the image.
We began with object localization. After deciding it would be quickest to draw bounding boxes manually, we developed a neural network to classify cropped, square images as one of 9 classes of helminth eggs.
We were so excited after the model’s accuracy jumped to over 99% with the implementation of transfer learning and curating our dataset, but the web application was not very user-friendly. We knew that the tool would need to perform object detection in order for it to be practical. We couldn’t expect a volunteer at a rural health outpost to perfectly crop each image to the right parasite - we wanted the tool to handle that step automatically.
Object localization has made great strides over the years. Here is a brief overview of its evolution:
- The “sliding window” technique, in which you slide a window across the image at different sizes, classifying as you go.
- Using selective search to propose regions in the image, then classifying them as object or background with a neural network (R-CNN)
- Using a neural network to propose learned regions in the image (Fast R-CNN)
- Combining the region proposal and classification into one neural network (Faster R-CNN)
- Dividing the image into a grid and computing object probabilities for each cell (YOLO, or “You Only Look Once”)
Image from YOLO
We tested a sliding window approach but it was too slow (over a minute for one image!), which led us to YOLO. While YOLO is super fast, it sacrifices accuracy. We knew that accuracy was most important (given a reasonable execution time) because of the consequences or our model’s predictions, so we looked to Faster R-CNN to increase accuracy. Specifically, we found “Focal Loss for Dense Object Detection”1 and the amazing keras-retinanet package to implement it. Their Faster R-CNN framework, called “RetinaNet”, improves on previous Faster R-CNN models by using a loss function called Focal Loss. In a nutshell, object detection algorithms tend to struggle because there are so many background examples and few positive object examples (in our case, stool background vs. helminth eggs). Focal Loss focuses on the positive objects and difficult background artifacts by down-weighting easy background examples, resolving the imbalance.
Soon enough (after provisioning a new server - our main SoftLayer instance was overworked with so many models training!) we had a workable, customized implementation of the RetinaNet classifier trained on our dataset. It is built on the resnet50 backbone - the winners of the 2016 ImageNet Challenge, with our own weights on the output layers and reaches 95.4% classification accuracy (at the time of this article). Read more in a subsequent blog post about how we use it.
Banner image from YOLO