Article Preview
Top1. Introduction
Correct localisation of the fruit is a necessary step especially while trying to locate small oranges from an orchard image (Bargoti et. al., 2017). Typically, prior work utilises hand engineered features to encode visual attributes that discriminate fruit from non-fruit regions. Although these approaches are well suited for the dataset they are designed for, feature encoding is generally unique to a specific fruit and the conditions under which the data were captured. More recently, advances in the computer vision community have translated to computer vision in agriculture (Lottes et. al., 2018), achieving state-of-the-art results with the use of Deep Neural Networks for object detection and semantic image segmentation. These networks avoid the need for hand-engineered features by automatically learning feature representations that discriminately capture the data distribution. Deep neural network based detectors have been demonstrated to be effective for fruit detection (Berenstein et. al., 2018).
MASK-RCNN Object Detectionextends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression. The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner. Mask R-CNN is simple to implement and train given the Faster R-CNN framework, which facilitates a wide range of flexible architecture designs. Additionally, the mask branch only adds a small computational overhead, enabling a fast system and rapid experimentation. In principle Mask R-CNN is an intuitive extension of Faster R-CNN, yet constructing the mask branch properly is critical for good results. Most importantly, Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool, the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction.
To fix the misalignment a simple quantization-free layer is proposed, called RoIAlignthat faithfully preserves exact spatial locations. Despite being a seemingly minor change, RoIAlign has a large impact: it improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics. Second, it was found essential to decouple mask and class prediction: a binary mask is predicted for each class independently, without competition among classes, and rely on the network’s RoI classification branch to predict the category.
Without bells and whistles, Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task, including the heavily engineered entries from the 2016 competition winner. As a by-product, this method also excels on the COCO object detection task. In ablation experiments, multiple basic instantiations were evaluated, which allows us to demonstrate its robustness and analyze the effects of core factors. This models can run at about 200ms per frame on a GPU, and training on COCO takes one to two days on a single 8-GPU machine. It is believed the fast train and test speeds, together with the framework’s flexibility and accuracy, will benefit and ease future research on instance segmentation. Finally, a showcase the generality of this framework via the task of human pose estimation on the COCO keypoint dataset. By viewing each keypoint as a one-hot binary mask, with minimal modification Mask R-CNN can be applied to detect instance-specific poses. Mask R-CNN surpasses the winner of the 2016 COCO keypoint competition, and at the sametime runs at 5 fps. Mask R-CNN, therefore, can be seen more broadly as a flexible framework for instance-level recognition and can be readily extended to more complex tasks.
This work was motivated by the problem of the detection of small fruits (Das et. al., 2015) under the occlusion and overlapping backgrounds for yield estimation but it also applies to the detection of other small objects in the same condition. Existing approaches to fruit detection experience difficulty in detecting small fruits that are hard to detect when they are occluded by leaves or overlapped among them, and the overall detection accuracy suffers as a result.