Object Detection is the ability to detect multiple classes of objects and identify them within the best possible concise bounding boxes in an image or video or live camera feed. There are various algorithms to achieve object detection like
For eg: let’s classify vehicles and persons in this image. Generally, the classification and location of multiple objects are known as Object detection.
If this article could be summed up in an expression:
(Image Classification + Image Localization) = Object Detection
Fundamentally the solution is to generate outputs that indicate probabilities of the detected object to be a particular class and the location of the bounding box containing that object.
Now here we use a convention to denote target locations of a detected object based on the presumed top-left coordinate
(0,0) in the image and locations as
Y coordinate for the centre of the bounding box.
To clarify, in the image:
The first array of values
[0.10, 0.85, 0.05] represent the generated probabilities of the detected object classes in a range
1. So the detected object was classified as a
5%. Hence the maximum probability of
Bus was selected.
Similarly, the second array of values
[0.3, 0.5, 0.5, 0.25] represent the width and height of the bounding box relative to the size of the overall image. So the centre of the
bus is around
0.3 from the left of the image,
0.5 from the top,
0.5 as wide as image and
0.25 as tall as the image.
Now to understand the process of object detection, we need to understand how to locate and classify multiple objects in an image. Out of various approaches, we take a look at some of the standard approaches.
Applying a classification model to multiple regions of the same image. And this could be achieved by using the
Sliding Windows approach. Here we create usually a square-shaped window and use it to place it in a specific region of the image; mostly starting at the top left part like in the figure below.
Then feed this into a convolutional neural network to classify this patch of the image. And we slide the window along to the neighbouring regions and continue the process to classify all of the window size regions in the image. And we can repeat the same process with different sized windows or on the rescaled versions of the image size so that we can detect objects of varying sizes. Now the only problem with this is that it is computationally impractical for a convolutional neural network to apply classifiers for every position of the window.
Thank you for reading. Please leave comments if you have any. Good day!