Save precious time with Image Augmentation in Object Detection Tasks

Alex Vaith
7 min readOct 1, 2021
annotated image with boxes around faces. source: roboflow

Computer Vision is mainly tackled as a supervised learning task. To create a powerful classification or detection model we need a lot of labeled data. Luckily there are a lot of resources to start with, which offer huge prelabelled datasets for a variety of different computer vision tasks, like human pose estimation, medical images, autonomous driving etc. But what if there is no labelled data for our problem at hand? In this case, we will have to create our own representative dataset! One of the key aspects of creating a custom dataset is the augmentation step.

In this article I will share my workflow to enlarge any dataset for object detection using image augmentation. Image augmentation plays a major role in computer vision to reduce the labelling time significantly while generating training data that will make your model generalize better!

What is Augmentation and why does it work so well?

Augmentation is the process of changing the input data while preserving the context. In case of images it means to flip the image, add noise to it, rotate it, change the color temperature etc. If we do this, we simulate different real world scenarios of the same object. For example we can assure that our model will be robust against the camera orientation, lighting conditions and much more. Our advantage is that we can do all of that with a single image multiple times by randomly mixing different augmentations techniques like flipping and color changing or rotating and adding noise. We can do that because each of those augmented images will be different for an untrained model, thus helping it to generalize better. Of course we should not over use the power of augmentations. For example it will not work well to only label a single image and augment it a few hundred times and use it as a stand along dataset. But a factor of up to 10 is reasonable if the differences between the augmented images is large enough.

Augmentation becomes even more important if we want to detect objects in an image. Drawing bounding boxes is a very time consuming and also a boring task to be honest. This is why I will save your precious time and show you how to augment images with existing bounding after the label process.

What do we need for Augmentation?

Before I show you how the magic works, I will outline what you need to follow the tutorial.

Our dataset will follow the PASCAL VOC format, which is commonly used by models that are available in the model zoo of the tensorflow object detection framework.

  1. images → in .jpg format
  2. annotations → a .csv file that stores all bounding boxes for all images of interest

There are a lot of great tools that help to speed up and streamline the process of image data labeling. From my own experience I can recommend the following two:

  1. LabelImg is a python based labeling tool, that has a user friendly ui, has all the needed tools for object detection and can read from and write to your local project folders, which is a nice bonus. Set up is straight forward as explained on the github page. https://github.com/tzutalin/labelImg
  2. MakeSense.ai is a web based labeling tool that not only allows you to drag bounding boxes around objects but also to label key points as well as segmentations on the image. If you are doing the project for educational purposes and only want to do try out manual labelling for yourself with well known objects, you can ask the AI of MakeSense to label the data for you as well. It will make recommandations on each image that you can accept or deny, which is pretty neat. https://www.makesense.ai/.

I would recommend you to export the annotations of MakeSense.ai directly in a csv file. If you have used LabelImg, you will need to convert the .xml annotations into a single .csv file. A conversion script could look similar to this one:

Let’s augment images and bounding boxes!

The main package we need for the augmentation task is imgaug. This awesome package allows you to apply all kinds of augmentation techniques to any image. And as a bonus, it will also adjust the coordinates of the bounding boxes accordingly. This makes it for me the perfect solution. So now let’s have a look at the steps we need to make:

First, we will define a sequential object that defines what kind of augmentation techniques will be applied to each image of our dataset. You can have a look at all supported augmentation techniques here. The ones I have added here do the following:

  • Fliplr → flip the image from left to right with a probability of 20%
  • Flipud → flip the image upside down with a probability of 20%
  • Crop → crop the image with a factor between 0 and 0.3 (30% of total size)
  • Sometimes → this argument can hold a list of augmentation that all have the same probability to be applied.
  • GaussianBlur → blur the image with a predefined value range for sigma
  • LinearContrast → change contrast to adapt for dirrent lighting conditions or camera presets
  • AdditiveGaussianNoise → apply noise that simulates different quality levels of cameras or that occur at challenging lighting conditions or when using digital zoom.
  • Multiply → make images brighter or darker, be care full with the settings. Since it will change the color temperature per channel it is hard to find settings that result in realistic images.
  • Affine → this will allow us to move the image in all possible directions. Be careful with the rotations. Studies have shown that you should apply at maximum 3° of rotations, otherwise the bounding box will not be positioned correctly anymore.

Next, we will define the function that is augmenting a single image with the predefined Sequential object.

  1. load the image
  2. extract all available bounding boxes given as coordinates x1, x2, y1, y2 from the table.
  3. merge all bounding boxes into a single object called BoundingBoxesOnImage
  4. create a list of placeholders for the number of augmentations we want to apply to the same image.
  5. use the sequential object to augment the images together with the bounding boxes.

Now we need a function that saves the augmented images and stores the new bounding box coordinates in a new table. We can also resize the image and the bounding box to the specific shape requested by the object detection model. This will save space on the GPU memory so that you can use bigger batch sizes and train faster.

  1. resize image to new shape
  2. remove bounding boxes that are mostly outside the augmented image. This happens with cropping. We can check if e.g. at least 20% of the bounding box’s area are still on the image.
  3. resize the bounding boxes as well to the new shape
  4. store coordinates, class label, height, width as well image name of each bounding box in the new table.
  5. if the image has at least one bounding box remaining after augmentation, we will also store the image.

Putting it all together we have the following script:

We are now able to enlarge any dataset with given bounding boxes to our needs. We have the maximum freedom to tweak the images as we want to. The object detection pipeline of tensorflow does also augmentation on the fly. But here you can only add basic transformations like flipping. Moreover, we can not visualize the augmentations and check if they are realistic or not. From my personal experience, resizing the images as well is the icing on the cake as it reduces the size of the dataset which will fasten up the whole pipeline during training. So let’s have a look at a possible outcome of our augmentation pipeline using the facial mask dataset from roboflow.

Pretty cool, huh? We see a combination of cropping, flipping and noise addition, which could realistically simulate a digital zoom from a different perspective.

Full code is available in this Github Project.

Take away

Image augmentation for object detection tasks reduces labelling time, increases the generalization of our model and speeds up the training time due to a reduced dataset size if we use image resizing as well.

--

--

Alex Vaith

Machine Learning Engineer / Data Scientist who likes to learn new stuff about AI every day.