Because our dataset is the first of its kind, the images needed to be processed into a clean format for modeling. We started with a file structure within an Amazon S3 bucket to identify the source of each image (e.g. different research labs or pictures taken by Kiersten), its label (type of helminth egg), and the type of microscope used. We then flattened the file structure, converted all images to jpeg format, and renamed each one ‘00000001.jpg’, ‘00000002.jpg’, and so on. Along with the images, the S3 bucket contained a csv file with file name, source, label, and type of microscope.

      This was useful for modeling, but we still needed to standardize the format of each image in order to feed them into a neural network. In particular, we needed square images that only contained the parasite. Though we explored automated methods of extracting the parasite from the image, we ultimately decided it would be quicker to draw bounding boxes by hand using a free software called FIJI. All ~2000 positive images (images containing a parasite) were bounded with a few hours’ contribution from each team member, and the coordinates of each square bounding box went into the csv file.

FIJI in action

A few of the images were closely-cropped and rectangular, so we had to add pixels until they were square:

Adding space to make the image square

Now it was simple to download the dataset, crop the images, resize them to a standard size, and feed them into a neural network to train.

Cameron Bell
Ultrarunner, backcountry skier, dad, and data scientist.
   berkeley-b-icon