Article Preview
TopIntroduction
Integration and reuse of information has been a focused topic in recent decades. To respond to this action, we propose a method for implementing 3D points measurement. With the rapid progress in computer vision techniques, it has become common to use the vision-based methods to perform many kinds of tasks in a wide of applications, the main reason lies in that the effective algorithms and the low cost devices that emerge promptly in last several years, for example small scale cameras, have made it easy to accomplish these work. One of the important and frequent usages of vision based systems is 3D points localization, which is also a fundamental technique for many modern engineering problems. Method of stereo vision is one of the most popular approaches to localize 3D points where a scene is imaged into two images from two different perspectives. This method usually needs to measure binocular disparity by extracting correspondence feature points with respect to two images and then triangulation method is used to recover the scene in 3D world. As to this method, the process of feature correspondence finding is filled with challenges such that it tends to fail for texture less regions of images. Besides, stereo vision method is also limited by the baseline distance between two cameras or two images. Apart from the stereo methods, there come new depth estimation methods corresponding to monocular cues from a single image, for instance, using RGB-D camera (Endres, Hess, Sturm, Cremers, & Burgard, 2017), camera with ultrasound, camera with ToF (Dashpute, Anand, & Sarkar, 2018; Ermert et al., 2000; Francis, Anavatti, & Garratt, 2011; Galarza, Martin, & Adjouadi, 2018; Lee, Song, Choi, & Ho, 2011; Song & Ho, 2017; Sugimoto, Kanie, Nakamura, & Hashizume, 2012), which is a method for depth estimation by measuring the time of flight between the camera sensor and object point. For these methods, additional instruments or complicated camera are used, and they lead an increase of cost in some degree. The depth from predicted semantic labels and depth from supervised learning methods exploit the cues of a single image such as texture variations, gradient, colors and other details to predict the scene depth. As to these two training methods for depth estimation, they need large datasets for every environment in practice (Saxena, Sun, & Ng, 2009), thus they lack enough flexibility in contrast with the other model based methods. Compared with above methods for distance, the defocus method utilizing defocused images gives a direct method for distance estimation, which can attain the real depth without additional instruments but an ordinary camera. The mechanism behind this method is that there is a definite relationship based on the property of camera lens between the defocus degree and the scene depth. A well calibrated camera lens system can effectively estimate the depth using defocus method in a relatively good accuracy as is discussed in (Lai, Fu, & Chang, 1992; Pasinetti, Bodini, Lancini, Docchio, & Sansoni, 2017; Pentland, Darrell, Turk, & Huang, 1989). Based this nature of camera lens, a motivation comes that positions of multiple 3D points may be measured by using only a single defocused image attained by an ordinary camera. However, the estimation from defocused images is much coarse in general, since the defocus model from (Pentland, Darrell, Turk, & Huang, 1989) is not perfect. Besides, the accuracy of this model will degrades with the distance becoming small or large in some scenarios that will be mentioned below. To overcome this issue and implement localization to multiple 3D points, an intuitive motivation is that the method of dense depth reconstruction (Stuhmer, Gumhold, & Cremers, 2010) from multiple images can be incorporated into depth estimation from defocus method. The combination of two approaches may be much cohesive such that the depth estimate can be improved and the scene’s structure can be obtained with a real scale.