 Research Article
 Open Access
Distance estimation with 2.5D anchors and its application to robot navigation
 Hirotaka Hachiya^{1}Email authorView ORCID ID profile,
 Yuki Saito^{2},
 Kazuma Iteya^{1},
 Masaya Nomura^{1} and
 Takayuki Nakamura^{1}
 Received: 30 January 2018
 Accepted: 28 August 2018
 Published: 10 September 2018
Abstract
Estimating the distance of a target object from a single image is a challenging task since a large variation in the object appearance makes the regression of the distance difficult. In this paper, to tackle such the challenge, we propose 2.5D anchors which provide the candidate of distances based on a perspective camera model. This candidate is expected to relax the difficulty of the regression model since only the residual from the candidate distance needs to be taken into account. We show the effectiveness of the regression with our proposed anchors, by comparing with ordinary regression methods and stateoftheart 3D object detection methods, through Pascal 3D+ TV monitor and KITTI car experiments. In addition, we also show an example of practical uses of our proposed method in a realtime system, robot navigation, by integrating with ROSbased simultaneous localization and mapping.
Keywords
 Deep learning
 Monocular camera image
 Distance estimation
 Navigation
Introduction
Detecting a target object from an image is an important task and recent deep learning based methods such as Faster RCNN [1] and YOLO [2] have enormously advanced its performance and speed. However, the location of an object on an image plane provided by object detection methods would not be enough for a real application. Standard approaches to measuring the distance from a monocular camera are to use triangulation over a pair of images captured by the camera moving along with navigation robots [3]. This approach is costeffective compared with the stereo camera; however, the movement to make the disparity would not be time effective. That is, as for tracking a target object, a robot may be required to detour to make the disparity for measuring the distance and thus it would delay the tracking.
In this paper, to tackle such the challenging task, we propose 2.5D anchors which provide the candidate of distances using perspective camera model. This candidate is expected to relax the difficulty of training regression model since only the small residual between GT and the candidate distances have to be taken into account. Using this proposed 2.5D anchor, called perspective anchor, we extend one of stateoftheart object detection method, Faster RCNN, and show its performance improvement by comparing with ordinary regression methods through experiments with Pascal 3D+ TV monitor dataset. In addition, we show that the performance of our proposed method is well comparable with the stateoftheart 3D object detection methods [5, 6] over KITTI car dataset. Finally, we show an example of practical uses of our proposed method, on a realtime system, robot navigation, by integrating with simultaneous localization and mapping (SLAM).
Related works
Related to the distance measurement from a monocular image, 3D object detection methods have been actively studied recently [5–8]. There are mainly two types of approaches for 3D object detection, i.e., modelbased and modelfree approaches. The modelbased approaches [7, 8] prepare a variety of 3D CAD models and make them fit to target objects on the image plane and infer its 3D position and pose. These modelbased approaches provide high performance given appropriate 3D CAD models but are limited only to rigid objects, e.g., car and TV monitor—nonrigid object like a human could not be detected. Meanwhile, modelfree approaches [5, 6] directly perform the regression of the dimension and orientation of 3D box using good initial deployment of candidate 3D boxes through subcategories [6] and MultiBin [5]. These methods do not need CAD models and thus can be more flexibly applied to a variety of objects including a human and an animal. In these 3D box detection approaches, both 2D BB and 3D box are detected and the projection matrix from the 3D box to 2D BB is estimated to obtain the 3D position of the object. However, there are as many as 8 target variables, i.e., 4 for 2D BBs and 4 for 3D box dimension (height, width, and length) and orientation at Yaxis in camera coordinate. Annotating these 8 target variables of 2D BBs and 3D boxes could be expensive since the visual inspection by a human is necessary—especially annotating 3D boxes on a 2D image would be difficult due to unseen parts of the object.
Therefore, in this paper, we propose a direct distance estimation method by extending, a 2D object detection method, Faster RCNN [1] to 2.5D object detection, e.g., 2D BB and distance.
Faster RCNN
2D anchors
A noteworthy mechanism of Faster RCNN is to use anchors as candidates of BBs—these anchors could cover a variety of BBs for multiple types of objects with various sizes and rotations. With such anchors, the regression problem of the BB can be simplified to the selection of most fitting 2D anchors and the regression of the residual between those 2D anchors and the GT BB.
2.5D bounding box estimation
From this observation, one advantage of estimating 2.5D BBs over 3D Box approaches [5, 6] in terms of the distance estimation, is this feasibility of data annotation, i.e., 3D Box approaches need 8 different annotations for 2D BB, 3D box dimension, and orientation. But let us clear the difference in the target applications between 3D approaches and our 2.5D BB approach. The target applications of 3D approach are to localize a variety of target objects e.g., car, bicycle and pedestrian in an arbitrary 3D space and make a 3D visualization of objects like a birds eye’s view using computer graphics (e.g., Fig. 4 in [6]). Meanwhile, our target application is a simple distance measurement of specific target objects, like cars in roads or TV monitors in rooms. That is, it is assumed in our approach, that the target objects would have relatively small variance in size; for example, the case that target objects in a range from miniature cars and real cars would be out of scope.
To perform such 2.5D BB estimation in Faster RCNN, we extend 2D anchor \({\mathbf {a}}_\text {2D}^i\) to 2.5D anchor \({\mathbf {a}}_\text {2.5D}^i=({\mathbf {a}}_\text {2D}^{j\top },z^j)^\top\) which additionally contains a distance candidate \(z^j\). Similarly to the original Faster RCNN, the key to success lies on a good design of anchors. To this purpose, we propose the perspective anchor based on the perspective camera model.
Perspective camera model
Base distance
Perspective anchor
Setting of base distance
Training with perspective anchors
Evaluation
In this section, we evaluate the performance of our proposed Faster RCNN with 2.5D anchors using PASCAL 3D+ TV monitor [9] and KITTI car [4] datasets. We implement our proposed method by extending the pyfasterrcnn codes provided by github [10]. Our code will be also available on github https://github.com/hirotakahachiya.
Evaluation metric
Evaluation on Pascal 3D+ TV monitor

Distance regression given 2D region proposals (ordinary regression)—the regression of the distance is performed in fully connected (FC) network for each selected 2D BB by region proposal (RP) network^{2} (see the network architecture described in Fig. 4).

Distance regression with 2D region proposals and MultiBin—following Eq. 5 in the paper [5], the regression of residuals from the mean distance computed from training data \(\mathcal {D}\) is performed in FC network for each selected 2D BB by the pretrained RP network.

Regression using 2.5D anchors with fixed base distance b set at each of \(\{1,2,3,4,5,6,7,8,9,10\}\)—as for 2D anchors, 3shape and 3scale (8, 16 and 32) anchors are used [1].

Regression using 2.5D anchors with the base distance b set by each of \(\{3,5,10\}\)percentile of the GT distances in training data \(\mathcal {D}\) (see Eq. 12)—as for 2D anchors, 3shape and 3scale (8, 16 and 32) anchors are used [1].
Average precision (AP) over foreground thresholds \(\{0, 0.1, 0.2,\ldots ,0.9, 1\}\) [16], precision of distance (PD) and total average precision (AP \(\times\) PD) for TV monitor in Pascal3D+
Regression method  AP  PD  AP \(\times\) PD 

With region proposals  0.77  0.47  0.36 
With region proposals and MultiBin  0.78  0.88  0.69 
With perspective anchors \(\alpha =3\)  0.78  0.91  0.71 
With perspective anchors \(\alpha =5\)  0.77  0.95  0.73 
With perspective anchors \(\alpha =10\)  0.79  0.95  0.75 
Overlaplevel of 2.5D anchors
Average IoU, precision of distance (PD) and relative distance error of the closest 2.5D anchors to GTs (BB and distance) in Pascal3D+ TV dataset
2.5D anchor  Avg. \(\mathbf {IoU}\)  Avg. \(\mathbf {E}_\text {dist}\)  PD (without regression) 

Fixed depth, \(z = 0\)  0.45  1.0  0.0 
Fixed depth, \(z = \bar{z}\)  0.45  0.39  0.29 
Perspective anchor, \(\alpha =3\)  0.45  0.26  0.50 
Perspective anchor, \(\alpha =5\)  0.45  0.27  0.49 
Perspective anchor, \(\alpha =10\)  0.45  0.33  0.35 
Figures 10, 11 and 12 depict examples of object detection and distance measurement of TV monitors in the case of 3percentile b. This shows that small, medium and large sized TV monitors can be detected and those distances are estimated accurately (GT and estimation are depicted in green and red respectively). Note that although the distance value in Pascal 3D+ is not absolute, it is not problematic for the evaluation purpose.
Evaluation on KITTI car

2D BB and the dimension and orientation of 3D box of the object are detected by the regression with subcategories or MultiBin.

Using the estimated corners of 2D and 3D boxes, the projection matrix (including translation and rotation) between 2D and 3D boxes are estimated by solving the optimization problem.
More in detail, in 3D box detection approaches [5, 6], since 2D BB and 3D box are detected, there are totally 8 annotated target variables, i.e., 4 for 2D BB, 4 for 3D box dimension and orientation, height, width, length and orientation at Yaxis in camera coordinate. Meanwhile, our proposed method estimates only 2D BB and distance, and there are only 5 target variables, i.e., \((x_\text {min}^i, y_\text {min}^i, x_\text {max}^i, y_\text {max}^i, z)\).
This difference would be a great advantage for our method when considering real applications since annotating 3D box by a human is prohibitively expensive. That is, for each target object in many images, a 2D BB and a 3D box need to be annotated manually by a human with a careful consideration of the correct orientation and dimension of the object [4]. Meanwhile, in our method, if a laser sensor calibrated with cameras is available in the data collection phase like KITTI dataset, the distance annotation could be systematically performed given a 2D BB annotated by a human, e.g., by taking the average of corresponding distances measured by the laser sensor. We note that if a laser sensor is not available in the data collection phase, the distance can be annotated later from a single image using the CAD model of the target object, as shown in “Application to navigation” section.
Overall, the experimental results with Pascal 3D+ TV monitor and KITTI car datasets show that the proposed method, 2.5D anchors is a promising approach for distance regression and even for 3D localization from the single camera image.
Application to navigation
Conclusion
In this paper, we have proposed 2.5D anchors (called perspective anchors), designed based on the perspective camera model, which are suitable for both bounding box and distance estimation in Faster RCNN. Through the experiments with Pascal 3D+ TV monitor and KITTI car datasets, we have shown the effectiveness of our proposed method in the distance estimation and even in the 3D localization. In addition, we have demonstrated an example of practical uses of our proposed method in a realtime system, robot navigation, by ROSbased simultaneous localization and mapping (SLAM).
In this paper, we consider estimating the distance of a specific targetobject category, i.e., TV monitors, cars or humans. However, in a real application such as an autonomous driving system, multiple target objects, e.g., pedestrian and car need to be treated at the same time. Thus, the task of multiple 3D object distance measurement would be our future work. Although we believe that our method could be flexibly extended to such case by setting multiple base distances to each 2D anchor, further research is needed to investigate an efficient way of the multiclass distance regression problem.
In addition, in this paper, we extend one of the stateoftheart object detection method, Faster RCNN [1]. Recently, there are advanced object detection methods, e.g., YOLO2 [2], which provide a better and faster performance. Thus, extending such advanced method with the concept of our proposed 2.5D anchor will be also a future work.
RP network is pretrained with 2D BB \(\{b_\text {2D}^{*i}\}_{i=1}^{N_\text {train}}\) in the training data \(\mathcal {D}\) and fixed when training the distance regression.
Declarations
Authors’ contributions
HH and YS conceived of the presented idea, designed the presented algorithm, and carried out the implementation of the algorithm. KI and MN implemented the system of presented robot navigation. HH, KI and MN carried out experiments. HH developed the theoretical formalism, carried out the analysis of the experimental results, and wrote the manuscript with the support of YS and TN. All authors read and approved the final manuscript.
Acknowledgements
This work was supported by JSPS KAKENHI Grant Number JP17H06871. We applicate Revast Co., Ltd and Dr. Yuta Kanuki for providing us the mobile robot mercury and images captured in Tsukuba challenge.
Competing interests
The authors declare that they have no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Ren S, He K, Girshick R, Sun J (2015) Faster RCNN: towards realtime object detection with region proposal networks. In: Advances in neural information processing systems (NIPS)Google Scholar
 Joseph R, Ali F (2016) Yolo9000: better, faster, stronger. arXiv preprint arXiv:1612.08242
 MurArtal R, Tardós JD (2015) ORBSLAM: a versatile and accurate monocular slam system. IEEE Trans Robot 31(5):1147–1163View ArticleGoogle Scholar
 Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings of 2012 IEEE international conference on computer vision and pattern recognition (CVPR2012)Google Scholar
 Mousavian A, Anguelov D, Flynn J (2017) 3D bounding box estimation using deep learning and geometry. In: Proceedings of 2017 IEEE international conference on computer vision and pattern recognition (CVPR2017)Google Scholar
 Xiang Y, Choi W, Lin Y, Savarese S (2017) Subcategoryaware convolutional neural networks for object proposals and detection. In: Proceedings of 2017 IEEE winter conference on applications of computer vision (WACV2017)Google Scholar
 Chabot F, Chaouch M, Rabarisoa J, Teuliere C, Chateau T (2017) Deep MANTA: a coarsetofine manytask network for joint 2D and 3D vehicle analysis from monocular image. In: Proceedings of 2017 IEEE international conference on computer vision and pattern recognition (CVPR2017)Google Scholar
 Xiang Y, Choi W, Lin Y, Savarese S (2015) Datadriven 3D voxel patterns for object category recognition. In: Proceedings of 2015 IEEE international conference on computer vision and pattern recognition (CVPR2015), pp 1903–1911Google Scholar
 Xiang Y, Mottaghi R, Savarese S (2014) Beyond PASCAL: a benchmark for 3D object detection in the wild. In: Proceedings of 2014 IEEE winter conference on applications of computer vision (WACV2017)Google Scholar
 Girshick R. Faster RCNN (Python implementation). https://github.com/rbgirshick/pyfasterrcnn
 Website of Tsukuba Challenge (2017) http://www.tsukubachallenge.jp
 Website of ARGO CORPORATION. https://www.argocorp.com/cam/usb2/tis/DxK22xUC03.html
 Website of TAMRON. http://www.tamron.biz/data/ipcctv/cctv_ir/13fm28ir.html
 Xiang Y. Pose\_Dataset. https://github.com/yuxng/Pose_Dataset
 Website of Revast. http://revast.co.jp
 Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vision (IJCV) 88:303–338View ArticleGoogle Scholar